The Key Differences Between Data Warehouses And Data Lakes

Data Warehouses and Data Lakes

Published August 5, 2021

Doug Rose

Author | Agility | Artificial Intelligence | Data Ethics

In my previous article The Role of Big Data in Machine Learning, I explain that the growing volume of diverse data that organizations now collect and store has been a driving force behind the development of machine learning. However, big data presents us not only with big opportunities in the world of machine learning; it also poses big problems in terms of capturing, storing, managing, and processing enormous volumes of data.

The problem that many organizations are having with big data is that their on-premises data warehouses simply cannot handle the volume, variety, and velocity of data being generated. The on-premises warehouses may also lack sufficient storage and processing power to generate reports or extract business intelligence from that data on a timely basis. Soon after an organization upgrades its on-premises data warehouse, it’s likely to outgrow that warehouse, and replacing a data warehouse is an expensive and time-consuming operation.

To delay the inevitable need to upgrade their data warehouse, many organizations will run reports at the end of the day, so they will be done the next morning or afternoon. In other organizations, where numerous employees frequently query the same data at the same time, they have to wait hours for results, and if the system crashes or freezes during the process, due to its lack of processing capacity, they have to start over. Many of these organizations rely on reporting in near real time to remain competitive.

The problem is growing. According to one estimate, within the next decade there will be more than 150 billion networked sensors in the world, each of which will be generating data 24/7 365 days a year. And just imagine all the data that humans generate in a single day on Facebook, Twitter, Google, online shopping sites, online gaming sites, and more.

Cloud Data Warehousing

To overcome the limitations of on-premises data warehousing solutions, more and more organizations are moving their data warehouses to the cloud — a vast network of storage and processing resources that are available via the Internet.

A cloud-based data warehouse offers the following advantages:

Unlimited storage and compute: With a cloud-based data warehouse, an organization will never outgrow its warehouse; the warehouse can expand simply by paying for more storage and compute capacity.
Superior performance and availability: Unlimited compute translates into better performance and availability. The organization no longer experiences concurrency issues — personnel accessing the data warehouse at the same time and competing for resources.
Scalability on-demand: Organizations can scale their use of storage and compute resources on-demand, so they can scale up during busy periods and scale down when demand is reduced.
Pay-per-usage: Cloud data warehouse providers can charge customers based on the resources they use. With on-premises solutions, the organization needs to build a system that is large enough to handle its periods of highest demand, even though they may need that capacity for only limited periods of time — such as during holiday shopping sprees.
No maintenance costs: Because the cloud data warehouse provider maintains the warehouse, the organization does not need its own data warehouse administrators and security experts. In addition, the provider can spend more on top-quality security personnel and technology and spread the costs across its consumer base to provide clients with superior security than what they may be able to achieve in-house.
All data in one place: Prior to the availability of cloud-based data warehouses, organizations often needed to store different types of data in different warehouses; for example, structured data in one warehouse and semi-structured data in another. With improved technology developed specifically for cloud-based data warehouses, organizations can now store all their data in one place, simplifying the process of querying and analyzing the data as a collective whole.
Simplified data sharing: Organizations no longer need to move data (for example, via email or file transfer protocol [FTP]) to share it. They can simply provide login credentials and online business intelligence (BI) tools to anyone needing access to the data, enabling the use to query and analyze that data remotely via the Internet.

Data Warehouses Versus Data Lakes

As you explore the topic of data warehousing, you will also encounter the term "data lake," and probably wonder what the difference is. Actually, there are several differences between a data warehouse and a data lake, including the following:

Data flow into a warehouse is restricted, whereas data flows freely into a data lake. Data doesn't flow into a data warehouse unless that data has a predefined use.
A data warehouse is typically used to collect and store operational data — data generated from within the organization and its partners — whereas a data lake stores data from external sources, as well.
Data in a warehouse is highly transformed and structured, whereas a data lake stores raw data.
While a data warehouse stores mostly structured and semi-structured data, a data lake stores all data types — structured, semi-structured, and unstructured.

Organizations typically use data lakes when they need to include external data sources in their analyses.

Putting Big Data to Work

Big data is valuable when applied to two closely related areas:

Business intelligence (BI): As more data becomes available, organizations can analyze that data to gain insight into the past, present, and future of the organization, any competitors, the industry overall, consumer preferences, and more.
Machine learning (ML): Data fuels machine learning. The availability of more data facilitates machine learning, while a greater variety of data leads to the development of different applications of machine learning.

The takeaway here is that big data is both a problem and an opportunity: It’s a problem in terms of capturing, storing, and processing all that data; but it provides unlimited opportunities in terms of analyzing that data to obtain valuable business intelligence and using that data to facilitate machine learning.

Cloud-based data warehousing helps to solve the problem of big data by providing organizations with access to unlimited storage and compute resources that can be scaled up or down on demand. This powerful combination of cloud-based data warehousing, business intelligence, and machine learning currently serves as a key driver to both innovation and growth.

August 7, 2017

Data Critical Thinking Skills

Doug Rose

No Comments

Data critical thinking skills are essential for your data science teams. See how to evaluate your critical thinking.

February 20, 2017

What is a Database Schema?

Doug Rose

No Comments

A database schema is the organization of data in a database. They are usually found in a relational database management system.

March 13, 2017

The Three Types of Data

Doug Rose

No Comments

There are generally three types of data. There are unstructured, semi-structured and unstructured.

Data Warehouses and Data Lakes

Cloud Data Warehousing

Data Warehouses Versus Data Lakes

Putting Big Data to Work

Quick Links

Contact

Follow Me On