Data science teams capture, store, and analyze data to extract valuable information and insight. In recent posts, I focused on capturing and storing three types of data — structured, semi-structured, and unstructured — and I encouraged readers to avoid the common trap of allowing big data to become big garbage.
In this post, I shift focus to analysis — using statistics, mathematics, and other analytical tools to extract meaning and insight from data. Although specific statistical methods vary considerably, they can be broken down into the following five categories:
- Description: Aggregating and summarizing data to reveal patterns.
- Probability: Determining the likelihood of an event or outcome.
- Correlation:Determining the degree to which one or more things are related.
- Causation: Determining whether one or more conditions or events were responsible for subsequent conditions or outcomes.
- Prediction:Forecasting future conditions or events based on currently available data.
In the following sections, I describe each of these five approaches to statistical analysis in greater detail and provide a word of caution at the end.
A descriptive statisticis a quantitative summary of a collection of data. Descriptive statistics include the following:
- Mean: The average of a set of values.
- Median: The middle value when all values are ranked from lowest to highest or vice versa.
- Mode: The most frequently occurring value in a set of values.
- Standard deviation: The amount of variation in a range of values.
- Sample minimum and maximum: The lowest and highest values in a value set.
- Kurtosis: A measure of the combined weight of the “tails” relative to the rest of the distribution in a value set. Imagine a bell curve with a tall, steep curve and a very wide base; such a curve would have long “tails” and relatively high kurtosis.
- Skewness:A measure of the symmetry or lack of symmetry in a value set.
Descriptive analytics are great for story-telling, proving a point, and hiding facts, which is why this approach is commonly used in political campaigns. One candidate may claim, “Over the last four years, average salaries have risen $5,000,” while her opponent claims, “Compared to four years ago, a typical middle class family now earns $10,000 less.” Who’s telling the truth? Maybe they both are. Opposing candidates often draw on the same data and use descriptive analytics to present it ways that support whatever point they’re trying to make.
In this example, the first candidate uses the meanto show that the average family earned about $5,000 more. The opposing candidate used the median(typical middle class family, not including poor or rich families) to make the case that a certain segment of the population was now earning less than they did four years ago. Both candidates are right, while neither candidate presents the entire truth.
Probabilityis the likelihood that something will happen. If you flip a coin, the probability is 50 percent it will land heads or tails. If you roll a six-sided die, you have a 1/6 or about a 17 percent probability of rolling any given number from one to six. Probability can also be used to gauge the likelihood of a coin landing heads twice in a row or rolling a specific number on a die twice in a row.
In data science, calculating probabilities can produce valuable insights. I once worked with a biotech company that was trying to determine the probability of people participating in a clinical trial, which is impacted by a number of factors. If participants are required to fast the night before, they’re about 30 percent less likely to participate. If a needle or a blood test is required, they’re about 20 percent less likely to participate. These results enabled the company to consider alternatives, such as replacing the blood test with a saliva test. However, they then had to analyze the possible impact on the results of the study if they made that change.
Data science is like that. The answer to one question may lead to other questions requiring additional analyses. When working on a data science team, be prepared to ask follow-up questions and harness the power of data to answer them.
Correlation is another very interesting area in data science. Many companies use correlation to analyze customer data and make targeted product recommendations. For example, Netflix looks at all the movies and TV shows you watched to recommend movies that are likely to appeal to you. Likewise, Amazon analyzes your purchase and search histories to recommend products you might like.
Correlations are commonly broken down into two categories:
- Positive: For example, cold weather and electricity usage — the colder it is outside, the more electricity people use to heat their homes.
- Negative: For example, the weight of a car and its gas mileage — generally, the heavier a car is, the lower its gas mileage. The two variables (weight and gas mileage) have an inverse relationship.
Data science teams look for correlations, as measured by thecorrelation coefficient— a value between –1 and 1 that indicates how closely related two variables are. Zero (0) indicates no correlation, 1 indicates a strong positive correlation, and –1 indicates a strong negative correlation.
Correlation is also useful for testing assumptions. For example, if a business assumes that customers who buy the most are the most satisfied, it could run correlation analysis to compare spending and satisfaction in order to prove or disprove that assumption.
Causationis correlation with a twist. The twist is that the two variables being studied are related through cause and effect. Keep in mind, however, that correlation does not prove causation. Just because it rains every time I forget to close my car windows doesn’t mean that forgetting to close the windows causes it to rain.
For example, when my parents got older, they moved to a retirement community in southern Florida. Statistically, their community is one of the most dangerous places on earth. People are constantly being hospitalized or buried. If you looked at the correlation between the community and rates of hospitalizations and deaths, you’d think they lived in a war zone. But the actual correlation is between age and rates of hospitalizations and deaths. The community is very safe.
If causation is proven, it can come in very handy for answering “Why” questions. Why do sales drop off in July? Why do people who live in a certain location have a greater incidence of lung cancer? Why are so many people returning our product? Careful analysis of the right data can answer these and other probing questions.
Perhaps the most fascinating and valuable application of statistical analysis is predictive analytics. Imagine having a crystal ball that enables you to see the future. By peering into that crystal ball, an organization could see the next big thing. It could tell what the competition was going to do before the competition knew. It could spot developing trends and be first to market with hot-selling products.
Think of predictive analytics as weather forecasting. Nobody really knows what the weather will be like the next day, but meteorologists can look at current data and compare it to historical data to make an accurate prediction of what the weather will be several days from today. They combine various types of data, including temperatures, pressures, humidity levels, and wind directions and speeds; analyze the data to spot patterns; correlate the patterns with historical data; and then draw their conclusions.
In the same way, organizations in a variety of sectors can analyze the data they have to spot patterns and trends and make predictions. And with the growing volume of data in the world, these predictions are becoming more and more precise.
Regardless of the approaches you use to analyze your data, be curious and skeptical. Data and the conclusions drawn from that data can be misleading, so challenge assumptions and conclusions. Ask questions, such as “Does this answer or conclusion make sense?” “Did we use the right data?” “What am I missing?” and “What else could this mean?” Use different analytical methods to examine the data from different perspectives. As a data scientist, you need to think like a scientist.