How to deal with missing data in data science and possible solutions to avoid coming to the wrong conclusions.
In a previous post, "Data Science Projects," I point out some of the key differences that separate data science from traditional project management. While traditional project management is focused more on goals, planning, and tangible deliverables, data science is a more open-ended operation with the focus on discovery and innovation — less tangible, but no less valuable, deliverables.
To arrive at a deeper understanding of the differences between traditional project management and data science, consider the unique challenges of a data science project:
Let’s look at a traditional project and compare it to what a data science team does. Then, we'll look at what often happens when traditional project management is applied to a data science team.
Consider a typical software project. Your organization wants to develop a human resources (HR) self-help portal for its employees. The project charter is to create the portal as a way to lower costs and improve overall employee satisfaction. The project will have a set cost, but the organization will save money by reducing HR costs and employee turnover. The estimated return on investment (ROI) for this project is substantial. The plan lays out all the features in a requirements document and includes a development schedule and detailed budget. The project manager will oversee development and update the plan to account for any changes in schedule, budget, or product requirements.
In contrast, consider how a data science team operates. The team is small — four to five people, including a research lead, a couple data analysts, and a project manager. Their "mission" is to help the organization come to a better understanding of the customers’ needs and behaviors in the hopes that this deeper understanding reveals opportunities to generate more revenue.
The research lead starts by asking questions such as these:
The data analysts do their job — analyze the data — to come up with answers to these questions. They deliver the answers in the form of data visualizations — graphic summaries of the data. For example, the data visualizations may be graphs that shed light on customer income and spend, as shown here. The x-axis (horizontal) represents income, and the y-axis (vertical) represents spending. Note that customers with higher incomes don’t necessarily spend more. Those who have an income around $20k–$30k seem to spend the most.
The analysts could also look at data from social media platforms and create a word cloud of feedback from thousands of customers, as shown below. For example, some of the largest words in the word cloud are “travel,” “recipe,” and “restaurant.”
Based on the knowledge and insight gleaned from these data visualizations, the team is likely to ask more questions, such as "Why do customers in a certain income bracket spend more than customers in higher or lower income brackets?" and "Why do our customers like to travel?" and "When our customers travel, where are they most likely to go?"
As you can imagine, knowing more about customers can lead to higher sales. The team could then share its discoveries with others in the organization. Marketing may decide to advertise more in travel magazines. Product development may shift its focus to products that are more closely related to travel. Sales might focus more if its efforts toward customers in a specific income bracket.
Then again, the team may hit a dead end. A data visualization created to analyze spending patterns among customers who travel and those who don't is inconclusive, as shown below. It reveals only that customers who travel outspend, by a relatively small margin, those who don't travel and that customers who do travel visit a variety of destinations around the world and the total spend by customers who travel to those destinations is no greater than the total spend by customers who don't travel. The data visualization doesn't provide sufficient evidence to support a change in what the company is doing, so the team abandons this line of enquiry and shifts direction.
Imagine trying to shoe-horn data science into a traditional project management framework. How would you define the scope of the project when your exploration can lead you in so many different directions? How can you meet predetermined milestones when you're building an ever-increasing body of knowledge and insight about your customers? How can you possibly meet a deadline when you don't know, specifically, what you're looking for? How do you budget for time when you have no idea how long it will take to find the answers?
Data science is all about learning, and "learning" is a verb. Specifically, it is a verb in the form of a present participle, which conveys continuous action. Data science is engaged in ongoing discovery and innovation. It doesn't conform to the traditional project management framework. Don't try to force it to.
How to deal with missing data in data science and possible solutions to avoid coming to the wrong conclusions.
Data science culture plays a big role in the types of data-driven insights and innovation that you get from your data analytics team.
Data sprints help your data science team deliver insights at a predictable pace.