In my previous article, Machine Learning Algorithms, I explain what machine-learning algorithms are and describe the following commonly used algorithms:
- Decision trees
- K-nearest neighbor
- K-means clustering
- Regression analysis
- Naïve Bayes
Based on the descriptions of the machine learning algorithms I presented in that post, you could already start to figure out which algorithm would be best for answering a certain type of question or solving a certain type of problem. In this article, I provide some additional guidance.
General Guidelines
Your choice of algorithm generally depends on what you want the algorithm to do:
- Decision: If you want the machine to make a decision, choose the best course of action, or draw a conclusion based on the evidence provided, a decision tree algorithm is probably the best choice.
- Classification and Clustering: If you want the machine to classify, categorize, or group, then you'll want to consider a classification algorithm, such as K-nearest neighbor, K-means clustering (for grouping), and Naïve Bayes.
- Prediction/Estimation: If you want the machine to predict a value in a continuous range of values, a regression algorithm is best — linear or logical regression.
When choosing an algorithm, consider a more empirical (experimental) approach. After narrowing your choice to two or more algorithms, you can train and test the machine using each algorithm with the data you have and see which one delivers the most accurate results. For example, if you're looking at a classification problem, you can run your training data on K-nearest neighbor and Naïve Bayes and then run your test data through each of them to see which one is best able to accurately predict which class a particular unclassified entity belongs to.
Taking a More Systematic Approach
There is a more formal method for choosing a machine-learning algorithm, as presented in the following sections.
Step 1: Categorize the Problem
The first step is to figure out the nature of the problem you are trying to solve via machine learning. Categorize the problem by both input and output:
1. Categorize the problem by input:
- Supervised learning, if the data is labeled.
- Unsupervised learning, if the data is unlabeled and your goal is to discover hidden patterns in the data.
- Reinforcement learning, if your goal is to optimize a certain function of the machine by interacting with a given environment, such as learning to play a game.
2. Categorize the problem by output:
- Regression, if the model’s output is a number.
- Classification, if the model’s output is a class.
- Clustering, if the model’s output is a set of groups.
Step 2: Examine Your Data
The data you have also informs your choice of machine-learning algorithm:
- Data quantity: Some algorithms perform well on small data sets, whereas others require very large data sets. For example, linear/logistic regression and naïve Bayes algorithms (with only a few parameters) may work with certain small data sets. Reinforcement learning may also work well with a small data set, because the machine will generate the data it needs through trial and error. In contrast, using unsupervised learning to solve a clustering problem typically requires a very large data set.
- Descriptive (summary) statistics: Statistics that describe your data, such as percentiles, averages, medians, and correlations can be valuable in identifying the right machine-learning algorithm. For example, if two variables have a strong correlation, a linear regression algorithm would probably work best.
- Data visualizations: Chart (graph) your data in various ways to identify relationships, spreads, and outliers. For example, a scatter plot may reveal several groupings of data points, which would suggest that K-means clustering or K-nearest neighbor is likely to be an effective algorithm.
Step 3: Consider the Constraints
Conditions beyond your control may influence your choice of machine-learning algorithm. For example:
- Limited storage or compute capacity may prevent the use of very large data sets.
- The speed at which the machine needs to be able to learn may require a training model that supports fast learning. For example, you may want to train the same model on different data sets.
- The speed at which the machine needs to make predictions can also influence your choice of algorithm. For example, a driverless car needs to be able to make split-second decisions.
Also, ask the following questions:
- How accurate does the model need to be?
- How complex is the model?
- How scalable is the model?
Step 4: Choose an Algorithm
The final step involves making your choice. The following table provides a list of algorithms along with specific use cases in which each application may be most suitable, as well as the pros and cons of each algorithm.
Remember, prior to building a machine learning model, it is always wise to consult others on your data science team, particularly your resident data scientist, if you are fortunate enough to have one. Choosing a machine learning algorithm is a combination of art and science, so you’re likely to benefit by having someone look at the problem from another perspective.