The key ingredient for a successful data scientist is a curious, skeptical, and innovative mind. However, data scientists also need to be familiar with the tools of the trade, various technologies, which can be broken down into the following categories:
- Data storage: Hardware and software used to capture, organize, and store data. More and more organizations are moving toward storing and processing data remotely in cloud-based data warehouses, which offer virtually unlimited storage and compute resources. Whether the data storage hardware is on-premises or in the cloud, organizations have a wide variety of database software from which to choose, including Oracle, MySQL, PostgreSQL, and Hadoop.
- Extract, transform, and load (ETL): When organizations migrate to a new data warehouse or import external data, they must perform ETL. Extract is the process of reading data from the database. Transform translates the extracted data into the format required by the new database, so the new data can be used with existing data. Load is the process of writing the new data to the target database. Popular ETL tools include IBM InfoSphere DataStage, Informatica PowerCenter, Microsoft SQL Server Integration Services (SSIS), and Oracle Data Integrator.
- Data cleansing: Sometimes included in the ETL process is data cleansing (or scrubbing), which consists of amending data or deleting duplicate, incorrectly formatted, incorrect, or incomplete data. Popular data cleansing tools include OpenRefine, Trifacta Wrangler, Drake, and TIBCO Clarity. Text editors, scripting tools, and programming languages such as Python and Scala can also be used for this purpose.
- Database query: A database query is a request for data from one or more tables in a database. A query extracts a subset of data from a database and may combine data from two or more tables. Database query tools include structured query language (SQL, pronounced “sequel”), NoSQL, and PostgreSQL.
- Business intelligence (BI) and visualization: BI and visualization tools extract, analyze, and transform data into meaningful information, typically displaying the results in reports, graphs, tables, maps, and other graphic formats that make the results of the analysis more understandable. Popular business intelligence software includes Sisense, Looker, Tableau, Answer Dock, and SAP Business Intelligence.
Moving Beyond the Tools
Keep in mind that tools are just that — devices for reducing the amount of work required. Choosing the best tools available is certainly important. For example, many organizations currently struggle to manage “big data” because they made huge investments in on-premises data warehouses that run Hadoop, which requires highly specialized database administrators (DBAs) just to keep it up and running. However, choosing the right tools is only a small part of extracting value from data. Organizations also need to come up with creative uses for that data.
Imagine you manage a website for connecting prospective car buyers to dealers. You build hundreds of information tags into the website that indicate when a shopper is hovering over a clicking a link. All the data flows into your data warehouse, representing several terabytes of data per week. You’ve collected historical data spanning several years. Collecting the data was relatively easy. The challenge is figuring out what to do with that data.
This seems like a common challenge for many organizations starting out in data science. They approach it mostly as an operational challenge and focus on collecting the data because it’s relatively cheap and easy to understand. It’s meeting-friendly and everyone can get behind the effort. They’ll even create multiple clusters or data lakes to pool their data from across the organization. But that’s the easy part. What organizations struggle with is the science of data science. They’re unaccustomed to asking and answering interesting questions.
As the data scientist for this imaginary car website, think about the questions you could ask to extract value from the data. You could examine the data to determine whether customers were more likely to click on a car if it were red, blue, or yellow. If the reports showed that customers are 2% more likely to click on a car if it’s red, the organization could share that with car dealerships to increase their sales and revenue. You could also experiment with the number of cars listed on each page to determine the optimum number of listings per page.
As a data scientist, this is the type of empirical research you should be considering. Using the tools at your disposal, you need to ask interesting questions, run experiments, and produce well-designed reports to convey the newfound knowledge and insight to your organization’s decision-makers and stakeholders.
Remember, it’s less about the tools and more about what you can do with those tools. A master carpenter can do wonders with a hammer, a handsaw, and a few other basic tools and materials, whereas someone unskilled in carpentry would have trouble building a simple chair in a fully-equipped woodworking shop. As a data scientist, you want it all — the skills and expertise along with the best tools available!