When organizations capture and analyze big data to extract knowledge and insight from it, they often must aggregate three diverse data types:
In this post, I highlight the differences among these three data types.
Comprising about 10 percent of all available data, structured data is typically stored in a relational database consisting of interrelated tables. Every record has a unique identifier (a primary key), and every data entry is “labeled” with a field name to simplify the process of using and managing the data. Structured data is like the bricks and mortar of the database world. It’s cheap, inflexible, and requires a lot of upfront development.
A good example of structured data is a typical office spreadsheet. When you fill rows with data, you have to stick to a pretty rigid format and structure. For example, suppose you have a column called “Purchase Date.” Each entry in that column must be in a specific format. You can’t have “Tuesday” as one entry and “March” as another. For a purchase date, you want to specify the day, month, and year, and you want all the entries to be consistent. One way to ensure consistency is to create a rule that every entry in the Purchase Date column must be in the MM/DD/YYYY format.
Without a consistent format, you would have a difficult time extracting specific data and using it to create a report, for example. Imagine trying to create a sales report for a certain date range if some dates were spelled out (such as February 2, 2019) while others were numerical (such as 03/29/2020) and some were abbreviated (such as Jan 23, 2020). Any query used to extract date information would need to cover all the variations in the date format.
Most data that an organization generates and uses internally is transaction data, which can be highly structured. Sources of structured data include the following:
With structured data, you have all your data neatly arranged in a relational database, where everything is labeled and organized. You know where everything is, and you know exactly where to find it. It’s like storing spices in labeled jars arranged alphabetically in a spice rack. You know where everything is and you know exactly where to find it.
Semi-structured data contains tags or other markers to “label” data entries without forcing the data into a strict structure such as a table. Semi-structured data is typically stored in a text file that includes metadata— the tags or markers that identify or provide additional information about the data entries. Files that include semi-structured data are commonly saved in one of the following formats, which are generally easy for humans to read and understand:
Sources of semi-structured data include the following:
The most common type of data is everything that isn’t structured or semi-structured: it is unstructured data. Some analysts estimate that 80 percent of all data is unstructured. When you think about it, this makes a lot of sense. Think about the data you encounter every day:
What does all this data have in common? Not much, and that’s part of the problem. None of this data is structured in any uniform way. The data contained in these files are not arranged in tables or entered into specific fields. The content may or may not be tagged to describe what it contains. Even the file formats differ. If you had a dozen documents you couldn’t open, you’d struggle to figure out what was in each one. The challenge would be even greater if you needed to extract details from the documents and aggregate relevant data.
Fortunately, companies such as Google have been working for years to overcome these limitations. Every time you search Google, Bing, or Yahoo!, you reap the fruits of their labor. Whenever search for something on these sites, you’re presented with a long list of links to a variety of content, including web pages, Word documents, PDFs, images, audio, and video. Organizations in diverse industries use many of the same and similar technologies to capture, store, retrieve, aggregate, and analyze unstructured data.
Putting Your Data to Work
Suppose you own a business that sells running shoes online and you want to identify your best customers. First, you may want to define what you mean by “best customer.” Is your best customer the one who spends the most money in your webstore? Is it the person who suggests ideas for improving your products? Is it the customer who recommends your product the most on social media? Maybe your best customer meets all these criteria.
If your definition of best customer is the person who spends the most in your webstore, all you need is a simple query to your transactional database to rank customers in order from those who spent the most to those who spent the least. For a more sophisticated 360-degree perspective, you could aggregate data from your transactional database, from social media sites such as Facebook and Twitter, and from other sources. Using customer names and email addresses you already have, you can crawl social sites to find out what customers posted about your product — maybe a comment on Facebook or Twitter or a video on YouTube. You can then analyze the aggregated data (structured, semi-structured, and unstructured) to identify patterns that are characteristic of a “best customer.”
As time goes on, you can capture more and more of your customers’ unstructured data, which will allow you to ask more sophisticated questions about your customers. For example, you may want to know which races they run in, so you can sponsor those races. Or you may wonder what other running products they purchase, so you can figure out how to expand your product selection. Answers to each of these questions can help you connect with your customers and sell more products.
As you begin to work with data, realize that all types of data have value, whether structured, semi-structured, or unstructured. Part of your job is to figure out ways to extract value from data regardless of how structured or unstructured that data may be.
Businesses and other organizations typically have two types of database management systems (DBMSs) — one for online transactional processing (OLTP) and another for online analytical processing (OLAP):
Traditional databases are optimized for OLTP, where the emphasis is on capturing transactional data in real time, securing transactional data, maintaining data integrity, and processing queries as quickly as possible. On the other hand, enterprise data warehouses (EDWs) are optimized for OLAP, where the emphasis is on capturing and storing large volumes of historical data, aggregating that data, and mining it for business knowledge and insights to support data-driven decision-making.
The following table highlights the differences between OLTP and OLAP
|Purpose||To capture and store transactional data in support of daily business operations||To gain business insight, solve problems, support decision-making, automate tasks|
|Database||Traditional relational database management system (RDBMS)||Enterprise data warehouse (EDW)|
|Data source||Daily business transactions||Consolidation of data from multiple sources, including OLTP and external sources|
|Data structure||Highly structured||Structured, semi-structured, and unstructured|
|Focus||Present||Past, present, and future (using historical data to plan for and predict future events)|
|Tasks||Insert, update, delete, sort, filter||Aggregate and analyze data to support decision-making|
|Response time||Milliseconds||Seconds, minutes, hours, or days depending on the volume of data, complexity of the query, and capabilities of the EDW|
|Storage requirement||Ranges depending on the organization’s size and transaction load||Relatively large due to the large volume of data|
|Software||Database management||Database management plus BI tools|
OLTP and OLAP in Action
Suppose you want to sell running shoes online. You hire a database administrator (DBA) who creates dozens of different tables and relationships. You have a table for customer addresses, a table for shoes, a table for shipping options, and so on. The web server uses structured query language (SQL) statements to capture and store the transaction data. When a customer buys a pair of shoes, her address is added to the Customer Address table, the Shoes table is updated to reflect a change in inventory, the customer’s desired shipping method is captured, and so on. You want this database to be fast, accurate, and efficient. This is OLTP.
You also ask your DBA to create a script that uploads each day’s data to your EDW. You have a data analyst create a report to see whether customer addresses are related in any way to the shoes they buy. You find that people in warmer areas are more likely to buy brightly colored shoes. You use this information to change your website, so customers from warmer climates see more brightly colored shoes at the top of the page. This is an example of OLAP. While you don’t need real-time results, you do need to be able to aggregate and visualize data to extract meaning and insight from it.
Copying Data from OLTP to OLAP
Most organizations have separate OLTP and OLAP systems, and they copy data from their OLTP system to their OLAP system via a process referred to as extract, transform, and load (ETL):
For more about ETL, see my previous post Grasping Extract, Transform, and Load (ETL) Basics.
The Best of Both Worlds
Some newer database designs attempt to combine OLTP and OLAP into a single solution, commonly referred to as a translytical database. However, OLTP systems are highly normalized to reduce redundancy, while OLAP reduces the required degree of normalization to achieve optimal performance for analytics.
Normalization is a process of breaking down data into smaller tables to reduce or eliminate the need to repeat fields in different tables. If you have the same field entries in different tables, when you update an entry in one table, you have to update it in the other; failing to do so results in a loss of data integrity. With normalization, when you need to change a field entry, such as a customer’s phone number, you have only one table in which you need to change it.
Because OLTP and OLAP differ in the degree to which data must be structured, combining the two is a major challenge. However, organizations are encountering an increasing need to analyze transactional data in real time, so the benefits of a translytical database model are likely to drive database and data warehousing technology in that direction.
To analyze a body of data, that data must first be loaded into a data warehouse; that is, it must be copied from one or more systems, converted into a uniform format, and written to the new destination. This process is commonly referred to as extract, transform, and load (ETL). ETL provides the means to combine disparate data from multiple sources and create a homogenous data set that can be analyzed in order to extract business intelligence from it.
During extraction, data is read from one or more sources and held in temporary storage for transformation and loading. An organization may extract data from its own internal systems, such as a transaction processing system that records all order activities or from external sources, such as data it purchases or obtains for free from other organizations.
Extraction is commonly broken down into two logical extractions methods:
Extraction is also broken down into two physical extraction methods:
During the transform stage, data is processed to make all data consistent in structure and format so that it all conforms to a uniform schema. A schema provides the structure and rules for organizing data in a relational database. The source and target database systems may use different schemas; for example, the source database may store shipping information in a Customer table, whereas the target database stores shipping information in a separate Shipping table. Or, the source table may have dates in the MM/DD/YYYY format, whereas the target uses the DD/MM/YYYY format. To successfully copy data from the source to the target, certain transformations must be made to ensure that the source data is in an acceptable format.
Transformations can be handled in two ways:
During the load operation, all newly transformed data is written to the target data warehouse for storage. Various mechanisms can be used to load data into the target warehouse, including the following:
Variations on the Theme
ETL is commonly described as a three-step process primarily to make it easier to understand. In practice, ETL is not a series of clearly defined steps but more of a single process. As such, the sequence of events may vary. Depending on the approach, ETL may be more like one of the following:
The ETL Bottleneck
Given the increasing volumes of data that organizations must capture and integrate into their data warehouses, ETL often becomes a major bottleneck. Database administrators need to constantly revise their ETL procedures to accommodate variations in the data arriving from different sources. In addition, the volume and velocity of data can overwhelm an organization’s existing data warehouse storage and compute capabilities, leading to delays in producing time-sensitive reports and business intelligence. ETL operations often compete for the same storage and compute resources needed to handle data queries and analytics.
Fortunately, data warehousing technology has evolved to help reduce or eliminate the impact of the ETL bottleneck. For example, cloud data warehousing provides virtually unlimited storage and compute resources, so that ETL does not need to compete with queries and analytics for limited resources. In addition, data warehouse frameworks such as Hadoop take advantage of distributed, parallel processing to distribute work-intensive tasks such as ETL over multiple servers to complete jobs faster.
With the right tools and technologies in place, organizations can now stream diverse data from multiple sources into their data warehouses and query and analyze that data in near real time. If you or your team is in charge of procuring a new data warehouse solution for your organization, look for a solution that provides unlimited concurrency, storage, and compute, to avoid contention issues between ETL processes and people in the organization who need to use the same system to run queries and conduct analysis. Also look for a system that can live-stream data feeds and process structured, semi-structured, and unstructured data quickly and easily without complicated and costly ETL or ELT processes. In most cases, the ideal solution will be data warehouse built for the cloud.
A schema is a formal description and/or illustration of how data is structured and related in a relational database. The description or illustration includes the following:
When presented as an illustration, a schema is commonly referred to as an entity relationship diagram (ERD).
This very simple ERD shows the relationship of three entities — doctor, patient, and room. Each entity represents a table with several fields. For example, the Doctor entity has the fields ID, First Name, Last Name, and Patient ID. Of these fields, one is the primary key, which is designated “PK” in the diagram. A primary key is a unique identifier. “FK” stands for foreign key, which is a primary key in one table that appears as a field in another table. For example, the Doctor table includes as a foreign key the Patient ID field, which is a primary key in the Patient table.
The lines used to connect the entities define their relationships through the use of the following symbols:
As shown in the ERD above, zero or many patients may be under the care of a single doctor, but only one doctor can be the primary care physician for any given patient. Likewise, zero or many patients may be assigned to a room, but only zero or one room may be occupied by any given patient; for example, a patient may be assigned to a room or allowed to leave the hospital without being assigned to a room.
You can add detail to each table by creating a third column that describes the data type for each field; for example, you could use “number” or “integer” for the ID fields and something like “text (50)” for the First Name and Last Name fields, to indicate that entries must be text and no longer than 50 characters. If you had a table with a Phone Number field, you could use something like “integer (10)” to indicate that the field must contain 10 numerals.
The Importance of Schemas
Schemas are important because they set the rules for adding data to the database and facilitate the process of extracting, filtering, sorting, and analyzing data. A schema ensures that every record in the database has a unique identifier (primary key), that all data entries in a particular field are formatted the same way, and that essential data is not omitted. By following a well-constructed schema, you ensure that the data is properly organized, which makes working with that data much easier.
A Common Challenge
In data science, one of the biggest challenges in managing a data warehouse arises when organizations need to import data from other sources for analysis. The organization has a schema in place to organize its own internal data, but this schema rarely matches the schema used for the data from external sources. For example, one organization may call the Customer ID field “Cust#,” while another calls it “CustID.” In many cases, the data from the external source isn’t from a relational database and, as a result, has no clear schema in place. The challenge is how to import disparate data into the organization’s existing database without ending up with a huge mess.
Database administrators overcome this challenge by performing a process called extract, transform, and load (ETL):
ETL provides the means to combine disparate data from multiple sources and create a homogenous data set that can be analyzed in order to extract business intelligence from it.
Schema-Less Data Structures
Not all databases are relational or carefully structured. Some are considered schema-less, meaning they do not conform to a strict structure and set of rules. For example, data need not be stored in tables or obey specific data type constraints. Schema-less databases offer the following benefits:
The key ingredient for a successful data scientist is a curious, skeptical, and innovative mind. However, data scientists also need to be familiar with the tools of the trade, various technologies, which can be broken down into the following categories:
Moving Beyond the Tools
Keep in mind that tools are just that — devices for reducing the amount of work required. Choosing the best tools available is certainly important. For example, many organizations currently struggle to manage “big data” because they made huge investments in on-premises data warehouses that run Hadoop, which requires highly specialized database administrators (DBAs) just to keep it up and running. However, choosing the right tools is only a small part of extracting value from data. Organizations also need to come up with creative uses for that data.
Imagine you manage a website for connecting prospective car buyers to dealers. You build hundreds of information tags into the website that indicate when a shopper is hovering over a clicking a link. All the data flows into your data warehouse, representing several terabytes of data per week. You’ve collected historical data spanning several years. Collecting the data was relatively easy. The challenge is figuring out what to do with that data.
This seems like a common challenge for many organizations starting out in data science. They approach it mostly as an operational challenge and focus on collecting the data because it’s relatively cheap and easy to understand. It’s meeting-friendly and everyone can get behind the effort. They’ll even create multiple clusters or data lakes to pool their data from across the organization. But that’s the easy part. What organizations struggle with is the science of data science. They’re unaccustomed to asking and answering interesting questions.
As the data scientist for this imaginary car website, think about the questions you could ask to extract value from the data. You could examine the data to determine whether customers were more likely to click on a car if it were red, blue, or yellow. If the reports showed that customers are 2% more likely to click on a car if it’s red, the organization could share that with car dealerships to increase their sales and revenue. You could also experiment with the number of cars listed on each page to determine the optimum number of listings per page.
As a data scientist, this is the type of empirical research you should be considering. Using the tools at your disposal, you need to ask interesting questions, run experiments, and produce well-designed reports to convey the newfound knowledge and insight to your organization’s decision-makers and stakeholders.
Remember, it’s less about the tools and more about what you can do with those tools. A master carpenter can do wonders with a hammer, a handsaw, and a few other basic tools and materials, whereas someone unskilled in carpentry would have trouble building a simple chair in a fully-equipped woodworking shop. As a data scientist, you want it all — the skills and expertise along with the best tools available!
Data modeling is the process of defining the structure, flow, and relationships of data stored in a database management system (DBMS). A data model provides the framework for all the data an organization generates and stores, and it facilitates the process of extracting, combining, and analyzing that data later. The goal of the data modeling process is to define the following:
A simple example of a data model is a spreadsheet with customer contact information. Row and column headings label every data entry, and data validation can be used to constrain data; for example, you can limit text entries to a certain number of characters. In practice, however, data models are much more complex, typically consisting of hundreds of individual tables that are related to one another in various ways.
A data model ensures that an organization’s data is structured — meaning that all data entries are clearly labeled and in a consistent format. Structuring data ensures consistency in naming conventions, default values, and data entry formats.
The Data Modeling Process
To build a data model, data architects must work closely with everyone in the organization responsible for entering data into the system as well everyone who will be using the information system to extract data, conduct analysis, generate reports, and so on. Only the people who enter data into the system and use that data to develop and execute the organization’s strategy know what the data model must contain.
The process of developing a data model typically involves the following steps:
Additional steps are required to address access, storage, and performance requirements.
The Three Stages of Data Modeling
The data modeling process is broken down into three stages or levels:
However, a data architect may also take a bottom-up approach, starting with existing forms, fields, software, and reports to reverse-engineer a data model from an organization’s existing system.
Ensuring Data Integrity
One of the key benefits of data modeling is that it helps to ensure data integrity — accuracy and reliability, both in terms of entity integrity and referential integrity:
Normalization is the process of systematically breaking down a large, complex table into smaller ones to eliminate disparities and redundancy, thus improving data integrity while making the DBMS easier to manage. For example, by creating separate tables for customer data and order data, if a customer moves to a new address, you simply make that change in the customer table, and the change is automatically reflected in any new orders the customer places. You don’t have to make the change in the customer table and the order table, which would require more work and more storage and be more susceptible to introducing data discrepancies in the system.
Of course, data modeling is much more complex and involved than what I describe here, but this post provides a basic understanding to get you started. Keep in mind, however, that not all data an organization needs is as structured as its internal data. For business intelligence, organizations often rely on semi-structured and unstructured data, a topic I will cover in a later post.