Dark logo

Data science, artificial intelligence (AI), and machine learning (ML) are very complex fields. Amidst this complexity, it is easy to lose sight of the fundamental challenges to executing a data science initiative. In this article, I take a step back to focus less on the inner workings of AI and ML and more on the artificial intelligence challenges that often lead to mistakes and failed attempts at weaving data science into an organization's fabric. In the process, I explain how to overcome these key challenges.

Embrace Data Science

The term "data science" is often misinterpreted. People tend to place too much emphasis on "data" and too little on "science." It is important to realize that data science is rooted in science. It is, or at least should be, exploratory. As you begin a data science program, place data science methodology at the forefront:

  1. Observe. Examine your existing data to identify any problems with the data (such as missing data, irrelevant or outdated data, and erroneous data) and to develop a deeper understanding of the data you have. 
  2. Ask interesting questions related to business goals, objectives, or outcomes. Nurture a culture of curiosity in your organization. Encourage personnel at all levels to ask questions and challenge long-held beliefs.
  3. Gather relevant data. Your organization may not have all the data it needs to answer certain questions or solve specific problems. Develop ways to capture the needed data or acquire it from external source(s).
  4. Prepare your data. Data may need to be loaded into your data warehouse or data lake, cleaned, and aggregated prior to analysis.
  5. Develop your model. This is where AI and ML come into play. Your model will extract valuable insights from the data.
  6. Evaluate and adjust the model as necessary. You may need to experiment with multiple models or versions of a model to find out what works best.
  7. Deploy the model and repeat the process. Deliver the model to the people in your organization who will use it to inform their decisions, then head back to Step 1 to continue the data science process.

Get Large Volumes of Relevant Data

Even the most basic artificial neural networks require large volumes of relevant data to enable learning. While human beings often learn from one or two exposures to new data or experiences, modern neural networks are far less efficient. They may require hundreds or thousands of relevant inputs to fine-tune the parameters (weights and biases) to the degree at which the network's performance is acceptable.

To overcome this limitation, AI experts have developed a new type of artificial neural network called a capsule network — a compact group of neurons that can extract more learning from smaller data sets. As of this writing, these networks are still very much in the experimental phase for most organizations.

Until capsule networks prove themselves or some other innovation enables neural networks to learn from smaller data sets, plan on needing a lot of high-quality, relevant data.

If you are lacking the data you need, consider obtaining data from external sources. Free data sources include government databases, such as the US Census Bureau database and the CIA World Factbook; medical databases, such as Healthdata.gov, NHS health, and the Social Care Information Centre; Amazon Web Services public datasets; Google Public Data Explorer; Google Finance; the National Climatic Data Center; The New York Times; and university data centers. Many organizations that collect data, including Acxiom, IRI, and Nielsen, make their data available for purchase. As long as you can figure out which data will be helpful, you can usually find a source.

Separate Training and Test Data

There are two approaches to machine learning — supervised and unsupervised learning. With supervised learning, you need two data sets — a training data set and a testing data set. The training data set contains inputs and labels. For example, you feed the network a picture of an elephant and tell it, "This is an elephant." Then, you feed it a picture of a giraffe and tell it, "This is a giraffe." After training, you switch to the testing data set, which contains unlabeled inputs. For example, you feed the network a picture of an elephant, and the network tells you, "It's an elephant." If the network makes a mistake, you feed it the correct answer, and it makes adjustments to improve its accuracy.

Sometimes when a data science team is unable to acquire the volume of data it needs to train its artificial neural network, the team mixes some of its training data with its test data. This workaround is a big no-no; it is the equivalent of giving students a test and providing them with the answers. In such a case, the test results would be a poor reflection of the students' knowledge. In the same way, an artificial neural network relies on quality testing to sharpen its skills.

The moral of this story is this: Don’t mix test data with training data. Keep them separate.

Carefully Choose Training and Test Data

When choosing training and test data for machine learning, select data that is representative of the task that the machine will ultimately be required to perform. If the training or test data is too easy, for example, the machine will struggle later with more challenging tasks. Imagine teaching students to multiply. Suppose you teach them multiplication tables up to 12 x 12 and then put problems on the test such as 35 x 84. They’re not going to perform very well. In the same way, training and test data should be as challenging as what the machine will ultimately be required to handle.

Also, avoid the common mistake of introducing bias when selecting data. For example, if you’re developing a model to predict how people will vote in a national election and you feed the machine training data that contains voting data only from conservative, older men living in Wyoming your model will do a poor job of predicting the outcome.

Don't Assume Machine Learning Is the Best Tool for the Job

Machine learning is a powerful tool, but it’s not always the best tool for answering a question or solving a problem. Here are a couple other options that may lead to better, faster outcomes depending on the nature of the question or problem:

As you introduce data science, artificial intelligence, and machine learning to your organization, remain aware of the key challenges you face, and avoid getting too wrapped up in the technologies and toolkits. Focus on areas that contribute far more to success, such as asking interesting questions and using your human brain to approach problems logically. Artificial intelligence and machine learning are powerful tools. Master the tools; do not let them master you.

Artificial intelligence and organizations are not always a great fit. While many organizations use artificial intelligence to answer specific questions and solve specific problems, they often overlook its potential as a tool for exploration and innovation — to look for patterns in data that they probably would not have noticed on their own. In these organizations, the focus is on supervised learning — training machines to recognize associations between inputs and labels or between independent variables and the dependent variable they influence. These organizations spend less time, if they spend any time at all, on unsupervised learning — feeding an artificial neural network large volumes of data to find out what the machine discovers in that data.

Observe and Question

With supervised learning, data scientists are primarily engaged in a form of programming, but instead of writing specific instructions in computer code, they develop algorithms that enable machines to learn how to perform specific tasks on their own — after a period of training and testing. Many data science teams today focus almost exclusively on toolkits and languages at the expense of data science methodology and governance.

Data science encompasses much more than merely training machines to perform specific tasks. To achieve the full potential of data science, organizations should place the emphasis on science and apply the scientific method to their data:

  1. Observe
  2. Question
  3. Research
  4. Hypothesize
  5. Experiment
  6. Test
  7. Draw conclusions
  8. Report

Note that the first step in the scientific method is to observe. This step is often overlooked by data science teams. They start using the data to drive their supervised machine learning projects before they fully understand that data.

A better approach is exploratory data analysis (EDA) — an approach to analyzing data sets that involves summarizing their main characteristics, typically through data visualizations. The purpose of EDA is to find out what the data reveals before conducting any formal modeling or testing or hypothesis about the data.

Unsupervised learning is an excellent tool for conducting EDA, because it can analyze volumes of data far beyond the capabilities of what humans can analyze, it looks at the data objectively, and it provides a unique perspective on that data often revealing insights that data science team members would never have thought to look for.

Note that the second step in the scientific method is to question. Unfortunately, many organizations disregard this step, usually because they have a deeply ingrained control culture — an environment in which leadership makes decisions and employees implement those decisions. Such organizations would be wise to change from a control culture to a culture of curiosity — one in which personnel on all levels of the organization ask interesting questions and challenge long-held beliefs.

Nurturing a Culture of Curiosity

People are naturally curious, but in some organizations, employees are discouraged from asking questions or challenging long-held beliefs. In organizations such as these, changing the culture is the first and most challenging step toward taking an exploratory approach to artificial intelligence, but it is a crucial first step. After all, without compelling questions, your organization will not be able to reap the benefits of the business insights and innovations necessary to remain competitive.

In one of my previous posts Asking Data Science Questions, I present a couple ways to encourage personnel to start asking interesting questions:

Another way to encourage curiosity is to reward personnel for asking interesting questions and, more importantly, avoid discouraging them from doing so. Simply providing public recognition to an employee who asked a question that led to a valuable insight is often enough to encourage that employee and others to keep asking great questions.

The takeaway here is that you should avoid the temptation to look at artificial intelligence as just another project. You don’t want your data science teams merely producing reports on customer engagement, for example. You want them to also look for patterns in data that might point the way to innovative new ideas or to problems you weren’t aware of and would never think to look for.

In one of my previous articles, How Machines Learn, I present a basic recipe for machine learning, including the essential ingredients and the step-by-step instructions for making it happen. One of the main ingredients is data, and sometimes lots of it. Just as people need data input to learn anything, so do machines. The key difference with machines is that the input needs to be digitized.

Another big difference is that machines are designed and built by humans, typically to perform specific tasks, such as driving a car, estimating a home's market value, recommending products, and so on. To a great degree, the purpose of the machine learning product and the data the machine needs to fulfill that purpose drive the design of the machine. The human developer needs to choose a statistical model that predicts values as close as possible to the ones observed in the data. This is called fitting model to data.

Why Fit the Model to the Data?

The purpose of fitting the model to the data is to improve the model's accuracy in the task it is designed to perform. Think of it as the difference between a suit off the rack and a tailored suit. With a suit off the rack, you usually have too much fabric in some areas and not enough in others. A tailored suit, on the other hand, is adjusted to match the contours of the wearer's body. Fitting the model to the data involves making adjustments to the model to optimize the accuracy of the output.

With machine learning, fitting the model involves setting hyperparameters — conditions or boundaries, defined by a human, within which the machine learning is to take place. Hyperparameters include the choice and arrangement of machine learning algorithm(s), the number of hidden layers in an artificial neural network, and the selection of different predictors.

The fine-tuning of hyperparameters is a big part of what data scientists do. They build models, run experiments on small datasets, analyze the results, and tweak the hyperparameters to get more accurate results.

Underfitting and Overfitting

Poor performance of a model can often be attributed to underfitting or overfitting:

The ultimate goal of the tuning process is to minimize bias and variance.

Consider a real-world example. Imagine you work for a website like Zillow that estimates home values based on the values of comparable homes. To keep the model simple, you create a basic regression chart that shows the relationship between the location of a house, its square footage, and its price. Your chart shows that big houses in nice areas have higher values. This model benefits from being intuitive. You would think that a big house in a nice area is more expensive than a small house in a rundown neighborhood. The model is also easy to visualize.

Unfortunately, this model isn't very flexible. A big house could be poorly maintained. It might have a lousy floor plan or be built on a floodplain. These factors would impact the home's value but they wouldn't be considered in the model. Because it’s not accounting for enough data, this model is likely to make inaccurate predictions; it suffers from underfitting, resulting in high bias.

To reduce the bias, you add complexity to the model in the form of additional predictors:

As you add predictors, the machine makes the model more flexible, but also more complex and difficult to manage. You solved the bias problem, but now the model has too much variance due to overfitting. As a result, the machine's predictions are off the mark for too many homes in the area.

Increasing Signal and Reducing Noise

To avoid underfitting and overfitting, you want to capture more signal and less noise:

In our Zillow example, you can capture more signal by choosing better predictors, such as number of bedrooms, number of bathrooms, quality of the school system, and so on, while eliminating less useful predictors, such as attic or basement storage. You really would need to examine the data closely to determine the factors that truly impact a home's value. In short, as the human developer, you would really need to put some careful thought into it.

Many organizations that try to implement machine learning are guilty of putting the cart before the horse. They build a machine learning team before they have any idea of what they will actually do with machine learning. They have no specific problems to solve or questions to answer that they cannot already solve or answer with their existing tools — their business intelligence software or spreadsheet application. They end up building a knowledgeable data science team, but all the team ends up doing is playing with the technology. Nobody on the team knows enough about the organization to identify areas in which machine learning could be of practical use. They need to start to ask data science questions.

How Not to Launch a Machine Learning Initiative

I once worked for an organization whose leadership was committed to machine learning and made a considerable investment in it. They assembled a team of machine learning experts from a local university and provided them access to the organization’s data warehouse. The team built the infrastructure it needed to implement machine learning, but it quickly reached a dead end. Nobody in the organization had given much thought to how this amazing new technology would benefit the organization.

When the team began to ask, “What questions do you need answered?,” “What problems do you need to solve?,” and “What insights gained would help drive business?,” nobody had an answer. In fact, nobody in the organization ever imagined asking such questions. The organization had a strong control culture in place, so employees generally did what they were told. They were not rewarded for asking interesting questions and often felt discouraged from doing so. When they did ask a question, it was something like, "What type of promotions do our customers like?," which is something that could be solved with traditional database or spreadsheet tools.

The members of the machine learning team felt as though they had built a Formula One race car that was just sitting in a garage.

Nurturing a Culture of Curiosity

Whether you have a data science team in place or are planning to create such a team, the first step is to build a culture of curiosity. Start by educating everyone in the organization about machine learning, so that, at the very least, they can recognize various ways it can be applied. Next, encourage everyone in the organization to start asking questions, looking for problems to solve, and sharing their ideas. Machine learning can benefit every team in your organization — including research and development, manufacturing, shipping and receiving, marketing, sales, and customer service. Have each department maintain a list of problems, questions, and desired insights; prioritize the items on the list; and then consider which technology would be the most effective for addressing each item. Keep in mind that the best technology isn't necessarily machine learning; sometimes, all you need is a data warehouse and business intelligence software.

Of course, questions, problems, and desired insights vary depending on the organization, but here are a few sample questions to get you thinking:

Here are a couple concrete ways to encourage people in your organization to start asking interesting questions:

Asking questions and calling attention to problems seems like a no-brainer. For any organization to survive and thrive, innovation is essential, and what sparks innovation are compelling questions and difficult problems. Unfortunately, many organizations have a strong control culture in which people are not rewarded and are often punished for asking questions and challenging the status quo. If that sounds like your organization, you need to find a way to break it free from its control culture and make everyone in the organization feel free to share their ideas and concerns.

In one of my previous articles Supervised and Unsupervised Machine Learning, I pointed out that machine learning can be used to analyze data in four different ways — two of which are predictive, and two of which are descriptive:

  1. Classification: Assigning items to different labeled classes
  2. Regression: Identifying the connection between a dependent variable and one or more independent variables
  1. Clustering: Creating groups of like things
  2. Association: Identifying associations between things

Understanding Regression Analysis

To understand machine learning regression analysis, imagine those tube-shaped balloons you see at children's parties. You squeeze one end, and the other end expands. Release, and the balloon returns to normal. Squeeze both ends, the center expands. Release one end, and the expanded area moves to the opposite end. Each squeeze is an independent variable. Each bulge is a dependent variable; it differs depending on where you squeeze.

Now imagine a talented balloon sculptor twisting together five or six of these balloons to form a giraffe. Now the relationship between squeezing and expanding is more complex. If you squeeze the body, maybe the tail expands. If you squeeze the head, maybe two legs expand. Each change to the independent variable results in a change to one or more dependent variables. Sometimes that relationship is easy to predict, and other times may be very difficult.

Business Applications of Regression Analysis

Regression analysis is commonly used in the financial industry to analyze risk. For example, I once worked for a credit card company that was looking for a way to predict which customers would struggle to make their monthly payments. They used a regression algorithm to identify relationships between different variables and discovered that many customers start to use their credit card to pay for essentials just before they have trouble paying their bills. A customer who typically used their card only for large purchases, such as a television or computer, would suddenly start using it to buy groceries and gas and pay their electric bill. The company also discovered that people who had a lot of purchases of less than five dollars were likely to struggle with their monthly payments. 

The dependent variable was whether the person would have enough money to cover the monthly payment. The independent variables were the items the customer purchased and the purchase amounts. Based on the results of the analysis, the credit card company could then decide whether to suspend the customer's account, reduce the account's credit line, or maintain the account's current status in order to limit the company's exposure to risk.

Businesses often use regression analysis to identify which factors contribute most to sales. For example, a company may want to know how it can get the most bang for its buck in terms of advertising; should it spend more money on its website, on social media, on television advertising, on pay-per-click (PPC) advertisements, and so on. Regression analysis can identify which items contribute most to not at all. The company can then use the results of that analysis to predict how its various advertising investments will perform.

Identifying the Dependent and Independent Variables

When performing regression analysis, the first step is to identify the dependent and independent variables:

An Important Reminder

Keep in mind that correlation does not prove causation. Just because regression analysis shows a correlation between an independent and a dependent variable, that does not mean that a change in the independent variable caused the change observed in the dependent variable, so avoid the temptation to assume it does.

Instead, perform additional research to prove or disprove the correlation or to dig deeper to find out what's really going on. For example, regression analysis may show a correlation between the use of certain colors on a web page and the amount of time users spend on those pages, but other unidentified factors may be contributing and perhaps to a greater degree. A web designer would be wise to run one or more experiments first before making any changes.

While regression analysis is very useful for identifying relationships among a dependent variable and one or more independent variables, use these relationships as a starting point for gathering more data and developing deeper insight into the data. Ask what the results mean and what else could be driving those results before drawing any hard and fast conclusions. 

Whenever an organization is looking to extract meaning from data, its leaders would be wise to consult a data scientist — a person who specializes in mining data for information and insights. A data scientist role is trained in various disciplines, including science, programming, data management, statistics, and machine learning, for the purpose of knowing how to collect, analyze, and interpret data, typically in support of the organization's decision-making process.

Specifically, a data scientist performs the following tasks:

Supporting the Data-Driven Decision-Making Process

In the past, many organizations based their decisions on organizational leadership's knowledge and insight. If they were honest, these leaders would have to admit that their decision-making process was more art than science. Decisions were based on historical data at best and pure hunches and conjecture at worst. 

With the growing availability of large volumes of diverse data, business intelligence (BI) software, and machine learning, decision-making has become more science than art. Now, machine learning algorithms can make highly accurate predictions and forecasts to guide the decision-making process. Algorithms can also be used to gain highly accurate insights into consumer behavior in order to market products and services to them much more effectively.

Another trend is the democratization of data — the availability of data and analytics at all levels to enable data-driven decision-making throughout the organization, not just at the upper echelons. We are now seeing everyone in a company, including marketing, sales reps, customer service reps, product development specialists, and manufacturing supervisors using BI software to inform their decisions.

Supporting this trend toward greater adoption of data-based decision-making is the data scientist, who ensures that everyone in the organization has access to the data and analytical tools they need.

Mining Data

Much of what a data scientist does involves data mining — the process of extracting value from data by using a combination of database management, statistics, mathematics, and machine learning. Although the methods can be complex, data mining relies primarily on old school logical processes, including the following:

Increased Automation

Data scientists also play a role in artificial intelligence (AI), supporting the drive toward increased automation with their expertise in machine learning. Automation includes expert systems that perform specific tasks, such as the following:

Look for a Scientist Who Works with Data

If you are looking to hire a data scientist, stress the importance of scientist over that of data. A good data scientist thinks like a scientist and strictly adheres to the scientific method:

  1. Observe.
  2. Identify a problem or question.
  3. Research the problem or question.
  4. Develop a hypothesis.
  5. Design an experiment.
  6. Collect and analyze the results.
  7. Formulate a conclusion.

Look for a candidate with an inquisitive and skeptical mind who is also familiar with business intelligence software, in addition to statistics, programming, and machine learning. You want someone who is good at not only answering questions, but, much more importantly, asking the right questions and challenging the answers.

When you’re working with data (regardless of the size of your data sets), you’re likely to encounter two terms that are often confused — data mining vs machine learning:

In short, data mining is much broader than machine learning, but it certainly includes machine learning. 

More About Data Mining

Data mining uses a very broad toolset to extract meaning from data. This toolset includes data warehouses and data lakes to store and manage data; extract, transform, and load (ETL) processes to bring data into the data warehouse; and business intelligence (BI) and visualization tools, which provide an easy means to combine, filter, sort, summarize, and present data in similar (though more sophisticated) ways than a spreadsheet application can do. 

Visualizations, such as the following, are particularly useful because they reveal patterns in the data that might otherwise go unnoticed:

More About Machine Learning

In the context of data mining, machine learning harnesses the computational power of a computer to find patterns, associations, and anomalies in large data sets in order to identify patterns in the data and use those patterns to make predictions. While BI and visualization tools enable humans to more readily identify patterns in data, machine learning sort of automates the process and often goes one step further to act on the meaning extracted from the data. For example, machine learning may identify patterns in credit card transaction data that are indicative of fraud and then use this insight to identify any future transactions as fraudulent or not, and block any suspected fraudulent transactions.

Machine learning is also useful for clustering — grouping like items in a data set to reveal patterns in the data that humans may have overlooked or never imagined looking for. For example, machine learning has been used in medicine to identify patterns in medical images that help to distinguish different forms of cancer with a high level of accuracy. 

Choosing the Right Approach

When your goal is to extract meaning from data, don't get hung up on the terminology or the differences between data mining and machine learning. Focus instead on the question you’re trying to answer or the problem you’re trying to solve, and team up with or consult a data scientist to determine the best approach. Here are a couple general guidelines:

Think of it this way: Imagine you manage a hospital and you're trying to determine why certain patients have better outcomes than others. You could approach this challenge from several different angles, including these two:

Each of these approaches has its own advantages and disadvantages. With the BI software approach, you would probably develop a deeper knowledge of the data and be able to explain the reasoning that went into the conclusions you've drawn. The process might even lead you to ask more interesting questions. Machine learning with an artificial neural network is more likely to identify unexpected patterns; the machine would view the data in a different way than humans typically do. This approach can also find non-interpretable patterns, which may make sense to the machine but not to the humans.

What's important is that you consider your options carefully. Avoid the common temptation to choose machine learning solely because it is the latest, greatest technology. Sometimes, Excel is all you need to answer a simple question.

How a data science team can collaborate effectively when managing a data product

As I explained in a previous article, “Building a Data Science Team,” a data science team should consist of three to five members, including the following:

  1. Research lead: Knows the business, identifies assumptions, and drives questions.
  2. Data analyst: Prepares data, selects BI tools, and presents the team’s findings.
  3. Project manager: Distributes results, democratizes data, and enforces learning.

Together, the members of the data science team engage in a cyclical step-by-step process that generally goes like this:

  1. Question: The research lead or other members of the team ask compelling questions related to the organization’s strategy or objectives, a problem that needs to be solved, or an opportunity the organization may want to pursue.
  2. Research: The data analyst, with input from other team members, identifies the data sets required to answer the questions and the tools and techniques necessary to analyze the data. The data analyst conducts the analysis and presents the results to the team.
  3. Learn: The team meets to evaluate and discuss the results. Based on what they learn from the results, they ask more questions (back to Step 1). They continue the cycle until they reach consensus or arrive at a dead end and realize that they’ve been asking the wrong questions.
  4. Communicate and implement: The project manager communicates what the data science team learned to stakeholders in the organization who then work to enforce the learning or implement recommended changes.

Data science teams also commonly run experiments on data to enhance their learning. This will help the team collaborate on many data-driven projects.

Experiments generally comply with the scientific method:

  1. Ask a question.
  2. Perform background research.
  3. Construct a hypothesis.
  4. Test with an experiment.
  5. Analyze the results and draw conclusions.
  6. Record and communicate the results.

Example

Suppose your data science team works for an online magazine. At the end of each story posted on the site is a link that allows readers to share the article. The data analyst on the team ranks the stories from most shared to least shared and presents the following report to the team for discussion.

The research lead asks, “What makes the top-ranked articles so popular? Are articles on certain topics more likely to be shared? Do certain key phrases trigger sharing? Are longer or shorter articles more likely to be shared?”

Your team works together to create a model that reveals correlations between the number of shares and a number of variables, including the following:

The research lead is critical here because she knows most about the business. She may know that certain writers are more popular than others or that the magazine receives more positive feedback when it publishes on certain topics. She may also be best at coming up with key words and phrases to include in the correlation analysis; for example, certain key words and phrases, such as “sneak peek,” “insider,” or “whisper” may suggest an article about rumors in the industry that readers tend to find compelling. This will create a visualization that can communicate even big data to people without a data skill set.

Based on the results, the analyst develops a predictive analytics model to be used to forecast the number of shares for any new articles. He tests the model on a subset of previous articles, tweaks it, tests it again, and continues this process until the model produces accurate “forecasts” on past articles.

At this point, the project manager steps in to communicate the team’s findings and make the model available to the organization’s editors, so it can be used to evaluate future article submissions. She may even recommend the model to the marketing department to use as a tool for determining how to charge for advertising placements — perhaps the magazine can charge more for ads that are positioned alongside articles that are more likely to be shared by readers.

Striving for Innovation

Although you generally want to keep your data science team small, you also want people on the team who approach projects with different perspectives and have diverse opinions. Depending on the project, consider adding people to the team temporarily from different parts of the organization. If you run your team solely with data scientists, you’re likely to lack a significant diversity of opinion. Team member backgrounds and training will be too similar. They’ll be more likely to quickly come to consensus and sing in a chorus of monotones.

I once worked with a graduate school that was trying to increase its graduation rate by looking at past data. The best idea came from a project manager who was an avid scuba diver. He looked at the demographic data and suggested that a buddy system (a common safety precaution in the world of scuba diving) might have a positive impact. No one could have planned his insight. It came from his life experience.

This form of creative discovery is much more common than most organizations realize. In fact, a report from the patent office suggests that almost half of all discoveries are the result of simple serendipity. The team was looking to solve one problem and then someone’s insight or experience led in an entirely new direction.

So what is data visualization? Data visualization is the process of communicating data graphically — in the form of tables, graphs, maps, timelines, matrices, tree diagrams, flow charts, and so on. Their purpose is to convey relationships, comparisons, distributions, compositions, trends, and workflows more clearly and succinctly than can be presented solely in words. You can think of a data science team’s reports as employing two forms of communication:

When building a report, the data science team combines the two forms of communication to tell the story revealed by the data with maximum clarity and impact. Visuals often provide the means of communicating complex information and insights with the greatest simplicity and effectiveness. Often, the audience immediately “gets it” upon viewing a simple graphic that summarizes the data.

Choose the Right Chart Type

When doing data visualizations, a key first step involves choosing the chart type that’s the best fit for the data and what you’re trying to illustrate. The following table provides general guidance to help you make the right choice.

PurposeChart Types
Compare valuesBar

 

Column

Line

Pie

Scatter plot

Spider chart

Show compositionArea

 

Pie

Stacked bar

Stacked column

Waterfall

Show distributionBar

 

Column

Line

Scatter plot

Show trendsColumn

 

Dual-axis line

Line

Show relationshipsBubble

 

Line

Scatter plot

Show locationsMap

Keep in mind that content and purpose should drive form. Don’t choose a chart or other visual just because it looks pretty. I’ve seen some beautiful charts that do a poor job of communicating the data, as well as ugly charts that are very informative. Ideally, you want a beautiful chart that’s informative and communicates the point you’re trying to make. However, if you have to make trade-offs, clarity trumps beauty.

A Team Sport

Creating data visualizations is a team sport. The data analyst should work closely with the other members of the data science team to develop data visualizations that communicate the data most effectively. If the data analyst has to explain the charts to the research lead, they’re probably too complex for other stakeholders in the organization. The team is a good testing ground for ensuring that the visuals in a report will be effective.

Remember that your team works together to explore the data, which means that the majority of the first round of reports you design will be for each other. The research lead drives interesting questions; the data analyst creates a quick and dirty report to explore possible answers; and then the team might come up with a whole series of new questions. This means that most of your initial data visualizations will be quick exchanges — more like visual chitchat than full data reports.

After the team reaches consensus on the data and the visuals, spend some time polishing the data visualizations to share them with the rest of the organization. Your final data visualizations should be even simpler and easier to understand than the versions you shared with team members.

Work in Cycles

Think of your first round of data visualizations as whiteboard presentations in your data science team meetings. Although you’ll probably do most, if not all, of your data visualizations on a computer, treat them like mock-ups or scribbles on a whiteboard. These data visualizations may be oversimplified. Their purpose is to initiate productive and creative discussions. You may start with a quick and simple scatter plot or linear regression chart and then fine-tune it as you ask more questions and collect and analyze more data. Obtaining and responding to feedback from other team members is the best way to create effective and attractive data visualizations.

Your best charts will be the product of an emergent design. Start with simple reports and improve them over time. You’ll produce much better reports by going through several revisions.

Recommended Books on Data Visualization

If you’re interested in discovering more about data visualization, I recommend the following two books:

Note: There’s typically nothing in the training of data analysts that prepares them for producing good visualizations. Most graduate programs are still very much rooted in math and statistics. Good data visualization relies on aesthetic and design. It’s a learned skill and may not come easy.

Building a data science culture means different things to different organizations. It may mean introducing a new data science team to the organization, democratizing the data so everyone has access to the data and the business intelligence (BI) tools to do their jobs, or encouraging the entire organization to develop a data-science mindset.

Whatever the meaning, data science organization change is difficult, especially if your organization strongly resists any major change — and many do. To effect a big change, you need some degree of competence in the field of change management— strategies and techniques to prepare, support, and assist individuals, teams, and organizations to adapt to new ideas.

Although change management is a complex topic, in this post I offer several suggestions to overcome common obstacles in implementing any change, including a change in your organization's culture.

Start with a Plan

Changing an organization's culture is an ongoing, often cyclical process, but before you start, draw up a linear step-by-step plan to ensure that you set out in the right direction. Here's a sample plan that you may want to tweak for your own use:

  1. Identify your organization's existing culture. See my previous post "Identifying Your Organization’s Culture." By knowing your existing culture, you have a better idea of the obstacles you're likely to encounter.
  2. Assemble a team of like-minded individuals — proponents of data science. As I explain in a previous post, "Busting Common Myths of Organizational Change,"some people are more receptive to change than others. When recruiting members for your team, look for natural innovators and early adopters.
  3. Find a high-level sponsor, if possible. An executive or someone in senior management would be a good choice. A high-level sponsor can be very helpful in championing your cause. If you can't find a high-level sponsor, however, you can still effect the desired change — you simply need to work with your team to implement the change from the bottom up.
  4. Start with one small team. If you go too big too soon, you may meet with heavy resistance, and any failures will be amplified. A small team can work below the radar until it has achieved some success.
  5. Celebrate the wins. When the data science team answers a compelling question, helps the organization overcome a challenge or solve a problem, or introduces an innovation, make sure everyone in the organization hears about it.

Get More than Superficial Support from Your Top-Level Sponsor

Having a top-level sponsor to cheer on your team while you do the hard work to effect a change is better than having no top-level support at all. However, any tangible support your top-level sponsor provides adds fuel to the tank and sends a signal to the rest of the organization that people at the top truly support your efforts. Tangible support may be provided in various forms, including the following:

Set Reasonable Expectations

Transforming a culture in which status and expertise drive the decision-making process to one in which data drives the process requires a major overhaul in how everyone in the organization thinks. It requires a never-ending process of continuous improvement. If your expectations are too high regarding the level of change and the time in which it occurs, you and others may get discouraged when you don't see quick, dramatic improvements.

To improve your chance of long-term success, manage everyone's expectations, including your own. Prepare your organization for a long and bumpy ride. Steer clear of quick fixes. Slow and steady wins the race. While this approach may sap some of the energy that drives change, it will help to prevent major disappointments, which tend to threaten overall success.

Change Minds, Not Just Infrastructure and Processes

Building a data science culture is about much more than building a data warehouse and rolling out state-of-the-art business intelligence tools. It's about changing the way people think about what they do and how they do it. According to some schools of thought, you can change people’s thinking by changing their behaviors. Others believe that you can change people’s behaviors by changing their thoughts. I recommend doing both:

Listen to the Skeptics

In any organization, you'll find pockets of resistance and even vocal critics of any proposed change. Don't ignore this resistance or merely try to steamroll a change over or past your critics. Listen to them and engage them in discussion. If data science truly holds value for your organization, you should have no trouble convincing skeptics. In addition, your critics may point out real weaknesses in your plan that you need to address for a successful implementation.

Don't Rely Solely on Outside Consultants to Drive Change

Many organizations hire outside consultants to implement a desired change in the organization. Some even treat consultants as disposable change agents — hiring a consultant to drive the change and then firing her when it fails. This practice gives management a convenient scapegoat.

A better approach is to choose a well-respected and longtime employee to drive the change internally with the mindset that the change is inevitable — failure is not an option. One or more consultants can then be brought in to provide expert knowledge and insight on how to more effectively implement a data science team. A charismatic insider can more effectively lead the charge by having some skin in the game and communicating in a language that the rest of the organization understands using examples that resonate with the organization's existing culture.

9450 SW Gemini Drive #32865
Beaverton, Oregon, 97008-7105
Dark logo
© 2022 Doug Enterprises, LLC All Rights Reserved
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram