In my previous post, “Challenging Evidence and Conclusions in Data Science,” I encourage data science teams to be skeptical of any claims or evidence that supports those claims, and I provide several techniques for challenging claims and evidence.
However, missing data can be just as misleading as wrong data, if not more so. One of the big problems with missing data is that people can’t see what’s not there. When you have data, you can check for errors and validate it. With missing data, you have nothing to check. You may not even think to ask about it or look for it.
For example, suppose you see the following graph with the headline: “Major Heat Wave in Atlanta!”
Your initial reaction might be that temperatures are rising precipitously in Atlanta and something must be done to reverse this dangerous trend. What’s missing from this graph? The months along the horizontal axis: January through July. Of course monthly temperatures are going to rise dramatically over the spring and summer months!
I once worked for an organization that was trying to figure out why more men than women were participating in their medication trials. A report from the company’s labs showed that 60 percent of its study participants were men compared to only 40 percent who were women. The data science team was assigned the job of finding out why men are more likely to participate in the company’s medication studies than women.
When team members received this report, they asked, “What significant information are we missing?” “What does it mean that men are more likely than women to participate?” Does that mean that more men applied or that equal numbers of men and women applied but that a greater number of men were accepted? Or does it mean that equal numbers of men and women applied and were accepted but more men actually participated?
This additional data would shift the team’s exploration in different directions. If more men applied, the next question would be “Why are men more likely than women to apply for our medication studies?” If equal numbers of men and women applied but more men were accepted, the next question would be “Why are more men being accepted?” or “Why are more women being rejected?” If equal numbers of men and women applied and were accepted but more men actually participated, the next question would be “Why are men more likely to follow through?” As you can see, the missing data has a significant impact on where the team directs its future exploration.
When you encounter a scenario like this, consider both what data might be missing and why it might be missing:
- Why is certain data missing? Data may be omitted intentionally or unintentionally. For example, maybe these numbers reflect only one or two lab studies and that if more data were provided, the percentages would even out. Maybe the studies had space or time limitations that would have impacted the number of women who were willing to participate. Maybe the person providing the data had ulterior motives for withholding certain data.
- What does the claim based on this data actually mean? When a number is preceded of followed by a comparative adjective and nothing more, it could be a signal that something is missing. For example, if you see a phrase such as “60 percent faster,” “lighter by over 30 percent,” and “20 percent more,” you should naturally ask “Than what?” If someone lost 60 percent more weight when taking a certain supplement, does that mean the person lost 60 percent more weight than others who didn’t take the supplement or 60 percent more weight than when he or she was not taking the supplement? What does 60 percent more weight equate to? It could be the difference between 1 pound and 1.6 pounds over six months, which would be rather insignificant.
- Why does it matter? Take the negative view. For example, why does it matter that more men than women participate in these medication studies? Would it be beneficial for the company to have more women participate in its studies?
This last question turned out to be significant. The benefit to having more women participate in the company’s studies is that young women are more likely to be on prescription medication, which would make the studies more comprehensive. The medication studies would be able to test for a greater number of drug interactions. The flip side is that many women couldn’t participate because they were taking a prescription medication that prohibited them from participating in the study. The statistic could then be rephrased as “60 percent of those who are allowed to participate in our medication studies are men.” This tells an entirely different story.
Data science teams need to remain vigilant regarding missing information. If a claim seems too good or too bad to be true, the team needs to question it and ask, “What’s the rest of the story? What’s missing? What’s been omitted, intentionally or not?” The team also should always be asking, “Do we have all the relevant data?”