Democratizing data is a big challenge in larger organizations where is you need to give access to data and make data available to everyone.
Many people have the misconception that science is equivalent to truth. In fact, people often cite science as the authority on a specific issue. They seem to believe that anyone who challenges scientific claims is challenging the truth and in so doing is completely un-scientific. The fact is that a large part of what goes on in the world of science is a continuous process of asking and answering questions and challenging theories and conclusions drawn from previous studies. Science, including data science, is about asking and seeking answers to relevant questions. Science is not truth; it is the search for truth.
However, all scientists are human. As such, they are susceptible to bias, logical fallacies, and other errors that can corrupt data and the conclusions drawn from that data. In this post, I describe the most common data science pitfalls in the hopes that by knowing about them, you’ll be better equipped to avoid them.
A major pitfall in data science is what I call the “cluster of dreams.” It’s based on the movie Field of Dreams starring Kevin Costner. The movie is about a man who spends his life savings building a baseball diamond in a cornfield. The ghosts of old players visit the area where the diamond is to be built and urge him on by saying, “If you build it, they will come.”
With data science, organizations often believe that if they put the right technologies in place and gather enough data, all their questions will be answered and all their problems solved. They focus their energy on building a data warehouse and collecting massive amounts of data. They make large investments in software to run on large data clusters. With everything in place, they begin to capture large volumes of data. Then, progress grinds to a halt. They have no idea what to do with that data.
To avoid this trap, take the following precautions:
Many organizations underestimate the shift in mindset required for a successful data science initiative. It’s not as simple as having a group of statisticians looking at the data. It’s about treating your organization’s data in a different way — not as a static resource that needs to be controlled, but as a dynamic resource that needs to be explored.
Changing the organization’s mindset involves letting go of strategies that may have worked well in the past. If you want to explore, you can’t have project objectives and planned outcomes. These are often barriers to discovery. You have to be comfortable with the idea that you don’t know where the data may lead.
To avoid this trap of setting objectives and planning outcomes, take the following precautions:
Data science teams should be focused less on objectives and more on exploration, but they still must produce something of value — organizational knowledge and insight. Teams that explore without a framework in place tend to wander. They get lost in the data and often make a lot of insignificant and irrelevant discoveries.
To avoid this trap, approach data science as you would approach agile software development. Work in “sprints” with regularly scheduled team meetings to ask questions, troubleshoot problems, and share stories. Storytelling is a great way to encourage the team to extract meaning and insights from the data. If the team is unable to tell a compelling story with the data and discoveries it has made, it probably isn’t aligning its efforts with the organization’s business intelligence needs.
In 1999, two psychologists conducted an experiment. They filmed six people passing a ball and showed it to 40 students. Prior to playing the video, they instructed the students to count the number of times the ball was passed. Most of the students came up with an accurate count. However, none of them mentioned the person in the gorilla suit who walked into the center of the circle, stopped, and then walked off camera. When asked about the gorilla, half the students hadn’t noticed it. In fact, they were so convinced it wasn’t there that they had to watch the video again to be convinced. The psychologists labeled this affliction perceptual blindness.
The same phenomenon occurs in organizations, and even with data science teams. People get so caught up in their routine work that they fail to notice what is most significant. On a data science team, members often get so involved in the process of capturing, cleaning, and storing the data that they overlook the purpose of that data.
For example, in a storytelling session, a data analyst clicked down deep into a data visualization to show the detail. The example was from an ad for a red Ford Mustang. For some reason this ad did very well. It had a much higher click-through rate. One of the stakeholders on the team interrupted the presentation and asked why the ad was so successful. The data science team hadn’t even considered that question.
When your data science team makes a discovery, it’s not “mission complete.” You have to connect the discovery to real business value, which isn’t always easy, because you rarely start out knowing what you’re looking for. At first, you may not even realize that what you discovered has any business value.
One of the benefits of working in sprints is that you discover and deliver insights a little at a time — every two weeks. With each sprint, you build on what you know. The research lead on the team has the opportunity to evaluate the insights every two weeks and connect them to business value. And if the team is on the wrong path, the research lead can point that out, and the team can discuss ways to change direction.
Forcing your data team to connect insights to business value is a pretty good way to avoid most, if not all, of the most common data pitfalls. As long as the team is focused on extracting valuable business intelligence from the organization’s data, it will remain on the right path.
Democratizing data is a big challenge in larger organizations where is you need to give access to data and make data available to everyone.
Data Science vs Software Engineering Projects. Data projects are much more about discovery than scope.
Big data applications will only give you useful insights if you have access to high quality data. If you're finding a lot of data garbage then it will make it difficult to learn something new.