“Data scientist” is more difficult to define than terms used to describe other scientists, such as chemist, biologist, geneticist, or meteorologist. Part of the problem may be due to the fact that “data science” became a commonly used term long before data science became a formal field of study. Even now people who call themselves data scientists come from diverse fields and industries and interact with data in different ways. Some work more as database administrators (DBAs), others lean more toward statistical analysis, while some focus most of their efforts on writing algorithms. (An algorithm is a process or set of rules for performing calculations, processing data, or solving problems.)
Simply put, a data scientist is anyone who extracts value from data using a variety of skills, tools, and methods, including human logic, statistical analysis, machine learning, and visualization software. Specifically, data scientists do the following:
- Identify areas in which data can be used to the benefit of the organization
- Gather data from internal or external sources to answer questions, solve problems, and automate processes
- Clean and organize the data to prepare it for use
- Choose appropriate tools and methods for extracting value from data
- Clearly communicate the knowledge and insights gleaned from the data or provide the stakeholders with visualization tools that support self-service analytics
If you’re a statistician, a data analyst, or a mathematician who specializes in developing machine learning algorithms, you can probably make a strong case that you’re a data scientist. However, as this field becomes more established, more and more organizations are looking for candidates who have a standardized skill set. Several universities, including Berkley, Syracuse, and Columbia are already moving in this direction, offering degree programs in the field of data science. Graduates are expected to have a wide variety of skills in the following areas:
- Computer science/programming
- Data warehousing
- Storytelling and data visualization
- Standard query language (SQL)
- Machine learning
Asking Interesting and Relevant Questions
A large part of what a data scientist does is ask interesting questions that are relevant to furthering (or challenging) the organization’s strategy and objectives.
Over the last 20 years, most organizations focused on increasing their operational efficiency by streamlining their business processes. They asked operational questions such as, “How can we work smarter, instead of harder?” and “How can we implement new technologies to save time and money?”
Data science is different; it isn’t objective-driven. It’s exploratory and uses a scientific method. It’s not about how well an organization operates; it’s about gaining useful business knowledge and insight. Part of the role of a data scientist is to work with leaders and other stakeholders in an organization to ask interesting and relevant questions and mine the data for answers. Questions are less objective-driven and more business-intelligence driven, such as:
- What do we know about our customer?
- How can we deliver a better product?
- Why are we better than our competitors?
These are all questions that require a higher level of organizational thinking, and most organizations aren’t ready to ask these types of questions. They are driven to set milestones and create budgets. They haven’t been rewarded for being skeptical or inquisitive.
Data scientists engage in data mining — the process of extracting value from data by using a combination of database management, statistics, mathematics, and machine learning. Although the methods can be complex, data mining relies primarily on old school logical processes, including the following:
- Descriptive statistics: Analyzing, describing, or summarizing data in a meaningful way to discover patterns in the data.
- Probability: Gauging the likelihood that something will happen.
- Correlation: Measuring the degree to which two things are related.
- Causation: Determining the likelihood that one event is the result of another event.
- Predictive analytics: Applying statistical analysis to historical data in an attempt to predict the future.
Delivering the Goods
One of the best ways to understand what any professional does is to look at what they produce or deliver — the fruits of their labor. Deliverables for data scientists include the following:
- Automated processes (such as loan approval/rejection)
- Classifications (for example, distinguishing credit card charges as valid or fraudulent)
- Forecasts (for example, sales and revenue)
- Pattern detection without classification (for example, to identify patterns in medical lab tests that indicate the presence of certain diseases)
- Predictions (for example, assessing a driver’s risk of getting into a serious accident based on demographics and driving records)
- Product recommendations (for customers based on their purchase history or searches they’ve conducted)
- Recognition (for example, facial or speech recognition)
Data science is all about harnessing the power of data to gain knowledge and insight, solve problems, automate processes, and make better decisions. The data scientist plays a key role in this process.