Data drives the data science team’s exploration and discovery, so the team must be on the constant lookout for bad data, which can lead the team astray or result in erroneous conclusions. In this post, I present several ways to challenge the data the team is provided to ensure that the team is working with accurate information and to generate addition questions that may lead to valuable discoveries.
Questioning the “Facts”
Many organizations rely on what they believe to be facts in their daily operations. Questioning these “facts” may be taboo for the rest of the organization, but they are fair game to the data science team. After all, one of the data science team’s key obligations is to challenge assumptions.
Whenever your data science team encounters a “fact,” it should challenge the claim by asking the following questions:
- Should we believe it?
- What evidence is available to support or refute it?
- How strong is the evidence to support or refute it?
- Does a preponderance of the evidence support or refute it?
When you’re working on the data science team, you’ll see all kinds of well-established “facts.” The source of these “facts” are numerous and varied: intuition, personal experiences, examples, expert opinions, analogies, tradition, whitepapers, and so on. Part of your job as a member of the data science team is to question these “facts,” not reject them outright. As you explore, you may find evidence to support the “fact,” evidence to refute it, a lack of evidence, or a mix of inconclusive evidence. Keep an open mind as you gather and examine the evidence.
Considering Alternate Causes
It’s easy to saythat correlation doesn’t imply causation — just because one event follows another doesn’t mean that the first event caused the second — but distinguishing the difference between correlation and causation is not always easy. Sometimes, it is easy. If you bump your head, and it hurts, you know the pain was caused from bumping your head.
However, sometimes, it is not so easy. For example, when a doctor noticed that many children were developing autism after receiving a vaccination to protect against measles, mumps, and rubella, he and some of his colleagues found it very tempting to suggest a possible cause-effect relationship between the vaccination and autism. Later research disproved any connection. It just so happens that children tend to develop autism about the same time they are scheduled to receive this vaccination.
Whenever your data science team encounters an alleged cause-effect relationship, it should look for the following:
- Whether the cause actually makes sense: Perform a reality check simply by asking whether the alleged cause-effect relationship makes any sense. For example, I know a guy who, for a time, refused to watch his favorite football team play because every time he watched a game his team lost, and every time he didn’t watch it won. Of course, after missing a few games in which his team lost, he realized the cause-effect relationship he had suspected was non-existent.
- Whether the cause is consistent with other effects: If the cause-effect relationship is similar to other cause-effect relationships in the same “family,” there’s a better chance it’s valid. For example, if you know that hot weather makes people buy more ice cream, chances are good that hot weather is probably responsible for a recent spike in popsicle sales.
- Whether the event can be explained by other causes: The team should ask, “What else could have possibly caused what we’re observing?” If other causes are possible, and especially if they’re more probable, your team would be wise to run some tests to identify the most likely cause.
Uncovering Misleading Statistics
While true that “numbers don’t lie,” people frequently use numbers, specifically statistics, to lie or mislead. A classic example is in advertisement, where 80 percent of dentists recommend a specific toothpaste. The truth is that in many of these studies, dentists were allowed to choose several brands from a list of options, so other brands may have been just as popular, or even more popular, than the advertised brand.
When your team encounters statistics or a claim based on statistics, it needs to dig into those numbers and identify the source of the information and how the numbers were obtained. Don’t accept statistics at face value.
Remember that a data science team can only be as good as the data (evidence) it has. Many teams get caught up in capturing more and more data at the expense of overlooking the data’s quality. Teams need to continuously evaluate the evidence. The techniques described in this post are a great start.
Bottom line, the data science team needs to be skeptical. When presented with a claim or evidence to back up a claim, it needs to challenge it. An old Russian proverb advises “Trust but verify.” I go a step further to recommend that you not trust at all — be suspicious of all claims and evidence that your data science team encounters.