Data Mining

Data Mining

Traditional ore mining begins with an exploration (prospecting) of a resource pool (stone), and proceeds to determining if usable resources exist (ore) and to what degree. Prospectors basically have an idea of what they are looking for, and they run small tests to see if they are correct. Sometimes they strike gold, other times they strike out. Like these physical mines that bring us everything from coal to diamonds, we have a new type of mining: data mining.

Data mining is akin to the discovery of patterns in large data sets. Like ore mining, data mining begins with an exploration (analysis) of a resource pool (data), and proceeds to determine whether usable resources exist (correlations) and to what degree (how strong they are). Not all data miners “strike it rich.” Like ore mining, data mining can result in the observation of no useful patterns. However, like ore mining, sometimes data mining leads to a bonanza of useful information.

In data mining, the emphasis is on the discovery of new knowledge. Data miners want to find new patterns that were previously unobserved. They use statistical analysis of big data to discover what the human eye can’t see, just like an ore miner might use a pick, dynamite, or lab test to uncover ore that was not visible to the naked eye before. This is a form of exploratory data analysis rather than statistical hypothesis testing.

Data Mining Strategies

Data mining involves six common classes of tasks, listed below, along with examples of how these strategies can be used in recommender systems, such as those used by Netflix, Pandora, Amazon,, and many other content providers. In each of the descriptions below, a Netflix-related example of its usage is given:

  • Anomaly detection (Outlier/change/deviation detection)—The identification of unusual data records that might be interesting or simply data errors and require further investigation.
    • Movie X is unlike any of the other movies in User Y’s data set. Remove it from our calculations. (example: The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends, and Clifford.)
  • Association rule learning (Dependency modeling)—Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
    • Recommender systems—Users who like Movie X tend to also like Movie Y.
  • Clustering—is the task of discovering groups and structures in the data that are in some way or another “similar,” without using known structures in the data.
    • Dynamically grouped movie categories: “Romantic Comedies in Paris starring former professional football players.”
  • Classification—is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as “legitimate” or as “spam.”
    • Movie X is a romantic comedy.
  • Regression—Attempts to find a function that models the data with the fewest errors.
    • Type X users typically increase their movie consumption rate by four movies per year.
  • Summarization—providing a more compact representation of the data set, including visualization and report generation.
    • What type of movie does User X typically like? (i.e., sum up user X’s preferences in Y words)

These strategies all have different purposes, are sometimes more effective on certain data sets and less on others, and oftentimes work best in conjunction with one other. Therefore, there is no one “best” way to perform data mining. Data miners use multiple strategies to uncover patterns and discover new knowledge.

Common misconception: Data mining is often confused with Artificial Intelligence (AI).
  • Data mining is actually an application of techniques commonly associated with AI. “Machine learning” and “decision support” are standard AI techniques, but when we apply them to “knowledge discovery in databases,” we refer to them collectively simply as “tools for data mining.”

How much power lies in data mining? Read the following article to see “How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did.”