When you’re working with data (regardless of the size of your data sets), you’re likely to encounter two terms that are often confused — data mining vs machine learning:
- Data mining is any way of extracting useful information or insights from data, primarily for the purpose of making better decisions about the future. (Note that you’re not mining data; you’re mining information and insights from that data.)
- Machine learning is the science of getting computers to do things they weren't specifically programmed to do. (Machine learning is often used to mine data — to extract valuable insights from data — but it is only one way of doing so.)
In short, data mining is much broader than machine learning, but it certainly includes machine learning.
More About Data Mining
Data mining uses a very broad toolset to extract meaning from data. This toolset includes data warehouses and data lakes to store and manage data; extract, transform, and load (ETL) processes to bring data into the data warehouse; and business intelligence (BI) and visualization tools, which provide an easy means to combine, filter, sort, summarize, and present data in similar (though more sophisticated) ways than a spreadsheet application can do.
Visualizations, such as the following, are particularly useful because they reveal patterns in the data that might otherwise go unnoticed:
More About Machine Learning
In the context of data mining, machine learning harnesses the computational power of a computer to find patterns, associations, and anomalies in large data sets in order to identify patterns in the data and use those patterns to make predictions. While BI and visualization tools enable humans to more readily identify patterns in data, machine learning sort of automates the process and often goes one step further to act on the meaning extracted from the data. For example, machine learning may identify patterns in credit card transaction data that are indicative of fraud and then use this insight to identify any future transactions as fraudulent or not, and block any suspected fraudulent transactions.
Machine learning is also useful for clustering — grouping like items in a data set to reveal patterns in the data that humans may have overlooked or never imagined looking for. For example, machine learning has been used in medicine to identify patterns in medical images that help to distinguish different forms of cancer with a high level of accuracy.
Choosing the Right Approach
When your goal is to extract meaning from data, don't get hung up on the terminology or the differences between data mining and machine learning. Focus instead on the question you’re trying to answer or the problem you’re trying to solve, and team up with or consult a data scientist to determine the best approach. Here are a couple general guidelines:
- If you have a clear idea of the insight you hope to gain, such as the number of people visiting your website over a specific period of time, a database or data warehouse coupled with BI or data visualization software is probably sufficient.
- If you need to extract meaning from a large volume of data and do not have a clear idea of how to answer a question or solve a particular problem, then you probably need to employ some type of machine learning — supervised or unsupervised. (See my previous article "Comparing Supervised and Unsupervised Machine Learning" for details.)
Think of it this way: Imagine you manage a hospital and you're trying to determine why certain patients have better outcomes than others. You could approach this challenge from several different angles, including these two:
- Use BI or data visualization software: Start by asking questions that you can answer by consulting the BI software, such as “Which doctors on staff have the greatest success rates?” or “Which patient follow-up programs resulted in the least number of return visits to the doctor?” Based on your findings, you can produce reports that state and support the conclusions you've drawn. The reports could lead to more questions requiring additional analysis.
- Employ machine learning: Use unsupervised machine learning on an artificial neural network. You throw all the data into the artificial neural network hoping that it will identify useful patterns. With patterns in hand, it’s up to you and your team to determine the relevance of those patterns and find out the cause(s) behind those patterns.
Each of these approaches has its own advantages and disadvantages. With the BI software approach, you would probably develop a deeper knowledge of the data and be able to explain the reasoning that went into the conclusions you've drawn. The process might even lead you to ask more interesting questions. Machine learning with an artificial neural network is more likely to identify unexpected patterns; the machine would view the data in a different way than humans typically do. This approach can also find non-interpretable patterns, which may make sense to the machine but not to the humans.
What's important is that you consider your options carefully. Avoid the common temptation to choose machine learning solely because it is the latest, greatest technology. Sometimes, Excel is all you need to answer a simple question.