A schema is a formal description and/or illustration of how data is structured and related in a relational database. The description or illustration includes the following:
- The tables that store the data
- The fields in each table
- The relationships among tables
When presented as an illustration, a schema is commonly referred to as an entity relationship diagram (ERD).
This very simple ERD shows the relationship of three entities — doctor, patient, and room. Each entity represents a table with several fields. For example, the Doctor entity has the fields ID, First Name, Last Name, and Patient ID. Of these fields, one is the primary key, which is designated “PK” in the diagram. A primary key is a unique identifier. “FK” stands for foreign key, which is a primary key in one table that appears as a field in another table. For example, the Doctor table includes as a foreign key the Patient ID field, which is a primary key in the Patient table.
The lines used to connect the entities define their relationships through the use of the following symbols:
As shown in the ERD above, zero or many patients may be under the care of a single doctor, but only one doctor can be the primary care physician for any given patient. Likewise, zero or many patients may be assigned to a room, but only zero or one room may be occupied by any given patient; for example, a patient may be assigned to a room or allowed to leave the hospital without being assigned to a room.
You can add detail to each table by creating a third column that describes the data type for each field; for example, you could use “number” or “integer” for the ID fields and something like “text (50)” for the First Name and Last Name fields, to indicate that entries must be text and no longer than 50 characters. If you had a table with a Phone Number field, you could use something like “integer (10)” to indicate that the field must contain 10 numerals.
The Importance of Schemas
Schemas are important because they set the rules for adding data to the database and facilitate the process of extracting, filtering, sorting, and analyzing data. A schema ensures that every record in the database has a unique identifier (primary key), that all data entries in a particular field are formatted the same way, and that essential data is not omitted. By following a well-constructed schema, you ensure that the data is properly organized, which makes working with that data much easier.
A Common Challenge
In data science, one of the biggest challenges in managing a data warehouse arises when organizations need to import data from other sources for analysis. The organization has a schema in place to organize its own internal data, but this schema rarely matches the schema used for the data from external sources. For example, one organization may call the Customer ID field “Cust#,” while another calls it “CustID.” In many cases, the data from the external source isn’t from a relational database and, as a result, has no clear schema in place. The challenge is how to import disparate data into the organization’s existing database without ending up with a huge mess.
Database administrators overcome this challenge by performing a process called extract, transform, and load (ETL):
- Extract: Data is read from one or more sources and held in temporary storage for the next two stages of the ETL process.
- Transform: Data is processed to make all data consistent in structure and format so that it all conforms to a uniform schema. For example, if some dates are formatted MM/DD/YYYY and others are formatted DD/MM/YYYY, a transformation would be necessary to format all dates consistently.
- Load: All data is loaded into the data warehouse for storage.
ETL provides the means to combine disparate data from multiple sources and create a homogenous data set that can be analyzed in order to extract business intelligence from it.
Schema-Less Data Structures
Not all databases are relational or carefully structured. Some are considered schema-less, meaning they do not conform to a strict structure and set of rules. For example, data need not be stored in tables or obey specific data type constraints. Schema-less databases offer the following benefits:
- No up-front data modeling is required. No schema is required, and no constraints need to be defined to ensure consistency in data entries.
- Data can be brought into the data warehouse without work-intensive ETL operations.
- The data warehouse has the capability of storing a variety of data.