Common Data Science Pitfalls

Published April 3, 2017

Doug Rose

Author | Agility | Artificial Intelligence | Data Ethics

Many people have the misconception that science is equivalent to truth. In fact, people often cite science as the authority on a specific issue. They seem to believe that anyone who challenges scientific claims is challenging the truth and in so doing is completely un-scientific. The fact is that a large part of what goes on in the world of science is a continuous process of asking and answering questions and challenging theories and conclusions drawn from previous studies. Science, including data science, is about asking and seeking answers to relevant questions. Science is not truth; it is the search for truth.

However, all scientists are human. As such, they are susceptible to bias, logical fallacies, and other errors that can corrupt data and the conclusions drawn from that data. In this post, I describe the most common data science pitfalls in the hopes that by knowing about them, you’ll be better equipped to avoid them.

Focusing on Capability

A major pitfall in data science is what I call the “cluster of dreams.” It’s based on the movie Field of Dreams starring Kevin Costner. The movie is about a man who spends his life savings building a baseball diamond in a cornfield. The ghosts of old players visit the area where the diamond is to be built and urge him on by saying, “If you build it, they will come.”

With data science, organizations often believe that if they put the right technologies in place and gather enough data, all their questions will be answered and all their problems solved. They focus their energy on building a data warehouse and collecting massive amounts of data. They make large investments in software to run on large data clusters. With everything in place, they begin to capture large volumes of data. Then, progress grinds to a halt. They have no idea what to do with that data.

To avoid this trap, take the following precautions:

Keep in mind that data science is exploratory. There’s no prize for having the biggest, most powerful data warehouse or the most data, and data isn’t the product. What’s important are the knowledge and insights gleaned from that data.
Don’t try to make up in tools what your team lacks in expertise. Owning a fully equipped workshop doesn’t make you a carpenter. Focus more on personnel and training and less on having the latest, greatest technology.
Let your data science team get messy. Let them use whatever combination of tools and techniques they deem most helpful to get the job done.

Setting Objectives and Planning Outcomes

Many organizations underestimate the shift in mindset required for a successful data science initiative. It’s not as simple as having a group of statisticians looking at the data. It’s about treating your organization’s data in a different way — not as a static resource that needs to be controlled, but as a dynamic resource that needs to be explored.

Changing the organization’s mindset involves letting go of strategies that may have worked well in the past. If you want to explore, you can’t have project objectives and planned outcomes. These are often barriers to discovery. You have to be comfortable with the idea that you don’t know where the data may lead.

To avoid this trap of setting objectives and planning outcomes, take the following precautions:

Focus on exploration and discovery. You should be guided more by curiosity than by objectives.
Take advantage of serendipity. You may find patterns in your data that nobody ever imagined looking for. These discoveries may lead your team in an entirely new direction or challenge long-held beliefs.
Be prepared to shift gears and change direction. By shifting gears, I mean try to analyze the same data in different ways or pull in more data. Changing direction may require ending a certain line of inquiry, asking more questions to dig deeper, or asking follow-up questions to steer down another path.

Working without a Framework

Data science teams should be focused less on objectives and more on exploration, but they still must produce something of value — organizational knowledge and insight. Teams that explore without a framework in place tend to wander. They get lost in the data and often make a lot of insignificant and irrelevant discoveries.

To avoid this trap, approach data science as you would approach agile software development. Work in “sprints” with regularly scheduled team meetings to ask questions, troubleshoot problems, and share stories. Storytelling is a great way to encourage the team to extract meaning and insights from the data. If the team is unable to tell a compelling story with the data and discoveries it has made, it probably isn’t aligning its efforts with the organization’s business intelligence needs.

Focusing Too Much on Routine Work

In 1999, two psychologists conducted an experiment. They filmed six people passing a ball and showed it to 40 students. Prior to playing the video, they instructed the students to count the number of times the ball was passed. Most of the students came up with an accurate count. However, none of them mentioned the person in the gorilla suit who walked into the center of the circle, stopped, and then walked off camera. When asked about the gorilla, half the students hadn’t noticed it. In fact, they were so convinced it wasn’t there that they had to watch the video again to be convinced. The psychologists labeled this affliction perceptual blindness.

The same phenomenon occurs in organizations, and even with data science teams. People get so caught up in their routine work that they fail to notice what is most significant. On a data science team, members often get so involved in the process of capturing, cleaning, and storing the data that they overlook the purpose of that data.

For example, in a storytelling session, a data analyst clicked down deep into a data visualization to show the detail. The example was from an ad for a red Ford Mustang. For some reason this ad did very well. It had a much higher click-through rate. One of the stakeholders on the team interrupted the presentation and asked why the ad was so successful. The data science team hadn’t even considered that question.

Failing to Connect Insights to Business Value

When your data science team makes a discovery, it’s not “mission complete.” You have to connect the discovery to real business value, which isn’t always easy, because you rarely start out knowing what you’re looking for. At first, you may not even realize that what you discovered has any business value.

One of the benefits of working in sprints is that you discover and deliver insights a little at a time — every two weeks. With each sprint, you build on what you know. The research lead on the team has the opportunity to evaluate the insights every two weeks and connect them to business value. And if the team is on the wrong path, the research lead can point that out, and the team can discuss ways to change direction.

Forcing your data team to connect insights to business value is a pretty good way to avoid most, if not all, of the most common data pitfalls. As long as the team is focused on extracting valuable business intelligence from the organization’s data, it will remain on the right path.

August 5, 2021

What’s the Difference between Data Mining vs Machine Learning?

Doug Rose

No Comments

In data science the difference between data mining vs machine learning is a key concept.

March 6, 2017

The Differences Between OLTP vs. OLAP

Doug Rose

No Comments

Businesses and other organizations typically have two types of database management systems (DBMSs) — one for online transactional processing (OLTP) and another for online analytical processing (OLAP): Online transactional processing (OLTP): A type of information system that captures and stores daily operational data; for example, order information, inventory transactions, and customer relationship management (CRM) details. […]

January 16, 2017

Data Modeling Basics

Doug Rose

No Comments

See the process of creating a data model. Learn the three stages of data modeling basics.

1 2 3 … 18 Next »

About Articles Courses

9450 SW Gemini Drive #32865
Beaverton, Oregon, 97008-7105