Defining Big Data is about better understanding big data sets and big data analytics. Also see the key characteristics of big data technologies.
In my previous article The Role of Big Data in Machine Learning, I explain that the growing volume of diverse data that organizations now collect and store has been a driving force behind the development of machine learning. However, big data presents us not only with big opportunities in the world of machine learning; it also poses big problems in terms of capturing, storing, managing, and processing enormous volumes of data.
The problem that many organizations are having with big data is that their on-premises data warehouses simply cannot handle the volume, variety, and velocity of data being generated. The on-premises warehouses may also lack sufficient storage and processing power to generate reports or extract business intelligence from that data on a timely basis. Soon after an organization upgrades its on-premises data warehouse, it’s likely to outgrow that warehouse, and replacing a data warehouse is an expensive and time-consuming operation.
To delay the inevitable need to upgrade their data warehouse, many organizations will run reports at the end of the day, so they will be done the next morning or afternoon. In other organizations, where numerous employees frequently query the same data at the same time, they have to wait hours for results, and if the system crashes or freezes during the process, due to its lack of processing capacity, they have to start over. Many of these organizations rely on reporting in near real time to remain competitive.
The problem is growing. According to one estimate, within the next decade there will be more than 150 billion networked sensors in the world, each of which will be generating data 24/7 365 days a year. And just imagine all the data that humans generate in a single day on Facebook, Twitter, Google, online shopping sites, online gaming sites, and more.
To overcome the limitations of on-premises data warehousing solutions, more and more organizations are moving their data warehouses to the cloud — a vast network of storage and processing resources that are available via the Internet.
A cloud-based data warehouse offers the following advantages:
As you explore the topic of data warehousing, you will also encounter the term "data lake," and probably wonder what the difference is. Actually, there are several differences between a data warehouse and a data lake, including the following:
Organizations typically use data lakes when they need to include external data sources in their analyses.
Big data is valuable when applied to two closely related areas:
The takeaway here is that big data is both a problem and an opportunity: It’s a problem in terms of capturing, storing, and processing all that data; but it provides unlimited opportunities in terms of analyzing that data to obtain valuable business intelligence and using that data to facilitate machine learning.
Cloud-based data warehousing helps to solve the problem of big data by providing organizations with access to unlimited storage and compute resources that can be scaled up or down on demand. This powerful combination of cloud-based data warehousing, business intelligence, and machine learning currently serves as a key driver to both innovation and growth.
Defining Big Data is about better understanding big data sets and big data analytics. Also see the key characteristics of big data technologies.
Hadoop basics is an apache Hadoop tutorial for beginners on how to manage big data. See key concepts such as hdfs, hive and hbase.
There are generally three types of data. There are unstructured, semi-structured and unstructured.