Learn how to avoid the most common data science pitfalls. These can come up in your products that use analytics, big data and even machine learning.

Published August 5, 2021

Doug Rose

Author | Agility | Artificial Intelligence | Data Ethics

In one of my previous articles Supervised and Unsupervised Machine Learning, I pointed out that machine learning can be used to analyze data in four different ways — two of which are predictive, and two of which are descriptive:

**Predictive**: With these methods, supervised learning enables the machine to forecast outcomes based on established patterns. Predictive analysis includes:

*Classification*: Assigning items to different labeled classes*Regression*: Identifying the connection between a dependent variable and one or more independent variables

**Descriptive**: With these methods, unsupervised learning enables the machine to detect patterns that reveal deeper insights into the data. Descriptive analysis includes:

*Clustering*: Creating groups of like things*Association*: Identifying associations between things

To understand machine learning regression analysis, imagine those tube-shaped balloons you see at children's parties. You squeeze one end, and the other end expands. Release, and the balloon returns to normal. Squeeze both ends, the center expands. Release one end, and the expanded area moves to the opposite end. Each squeeze is an independent variable. Each bulge is a dependent variable; it differs depending on where you squeeze.

Now imagine a talented balloon sculptor twisting together five or six of these balloons to form a giraffe. Now the relationship between squeezing and expanding is more complex. If you squeeze the body, maybe the tail expands. If you squeeze the head, maybe two legs expand. Each change to the independent variable results in a change to one or more dependent variables. Sometimes that relationship is easy to predict, and other times may be very difficult.

Regression analysis is commonly used in the financial industry to analyze risk. For example, I once worked for a credit card company that was looking for a way to predict which customers would struggle to make their monthly payments. They used a regression algorithm to identify relationships between different variables and discovered that many customers start to use their credit card to pay for essentials just before they have trouble paying their bills. A customer who typically used their card only for large purchases, such as a television or computer, would suddenly start using it to buy groceries and gas and pay their electric bill. The company also discovered that people who had a lot of purchases of less than five dollars were likely to struggle with their monthly payments.

The dependent variable was whether the person would have enough money to cover the monthly payment. The independent variables were the items the customer purchased and the purchase amounts. Based on the results of the analysis, the credit card company could then decide whether to suspend the customer's account, reduce the account's credit line, or maintain the account's current status in order to limit the company's exposure to risk.

Businesses often use regression analysis to identify which factors contribute most to sales. For example, a company may want to know how it can get the most bang for its buck in terms of advertising; should it spend more money on its website, on social media, on television advertising, on pay-per-click (PPC) advertisements, and so on. Regression analysis can identify which items contribute most to not at all. The company can then use the results of that analysis to predict how its various advertising investments will perform.

When performing regression analysis, the first step is to identify the dependent and independent variables:

- Dependent variable is what you are trying to understand or predict; for example, whether a customer is about to miss one of his credit card payments.
- Independent variables are factors that may have an impact on the dependent variable; for example, the customer's spending habits or purchase amounts prior to the date on which the next credit card payment is due.

Keep in mind that *correlation does not prove causation*. Just because regression analysis shows a correlation between an independent and a dependent variable, that does not mean that a change in the independent variable caused the change observed in the dependent variable, so avoid the temptation to assume it does.

Instead, perform additional research to prove or disprove the correlation or to dig deeper to find out what's really going on. For example, regression analysis may show a correlation between the use of certain colors on a web page and the amount of time users spend on those pages, but other unidentified factors may be contributing and perhaps to a greater degree. A web designer would be wise to run one or more experiments first before making any changes.

While regression analysis is very useful for identifying relationships among a dependent variable and one or more independent variables, use these relationships as a starting point for gathering more data and developing deeper insight into the data. Ask what the results mean and what else could be driving those results before drawing any hard and fast conclusions.

Related Posts

Learn how to avoid the most common data science pitfalls. These can come up in your products that use analytics, big data and even machine learning.

Data scientists will present analytics data as facts, but you need to challenge the data science evidence to see what assumptions the team is making.

Check out this list of machine learning dos and don'ts to help avoid common mistakes on your machine learning project.