Data critical thinking skills are essential for your data science teams. See how to evaluate your critical thinking.
Big data is a term that describes an immense volume of diverse data typically analyzed to identify patterns, trends, and associations. However, the term “big data” didn’t start out that way. In 1997, NASA researchers Michael Cox and David Ellsworth described a “big data problem” they were struggling with. Their supercomputers were performing simulations of airflow around aircraft and generating massive volumes of data that couldn’t be processed or visualized effectively. The data were pushing the limits of their computer storage and processing, which was a problem — a big problem. In this context, the term “big data problem” was used more to describe a big problem than big data; NASA was facing a big, data problem, not so much a big-data problem.
A decade later, a McKinsey report entitled “Big data: The next frontier for innovation, competition, and productivity” reinforced the use of the term “big data” in the context of a problem that “leaders in every sector will have to grapple with.” The authors refer to big data as data that exceeds the capability of commonly used hardware and software.
Over time, defining big data has taken on a life and meaning of its own, beyond the context of a problem, to include the potential value of that data, as well. Now, big data poses both big problems and big opportunities.
Many organizations that start big-data projects don’t actually have big data. They may have a lot of data, but volume is just one criterion. These organizations may also mistakenly think that they have a big-data problem, because of the challenges they face in capturing, storing, and processing their data. However, data doesn’t constitute big data unless it meets the following criteria (also known as the four V’s):
Volume: In the world of big data, volume is no longer measured in megabytes and gigabytes but is now measured in terabytes, petabytes, exabytes, zettabytes, and yottabytes. Imagine the volume of data generated around the world every day by the over six billion smart phone users. Add to that Internet data and machine-generated data from the growing number of devices that comprise the Internet of Things (IoT), along with data from numerous other sources.
An interesting example of a big data problem is the challenge surrounding self-driving cars. To enable a self-driving car to safely navigate from point A to point B without running over pedestrians or crashing into objects, you would need to collect, process, and analyze a heavy stream of diverse data, including audio, video, traffic reports, GPS location data, and more, all flowing into the database in real time and at a high velocity. You would also need to evaluate which data is most reliable; for example, the historical data showing that the left lane is open to traffic, or the live video of a sign telling drivers to merge right. Is that person standing at the corner going to dart out in front of the car or wait for Walk signal? Whether the driver is human or the car is navigated by big data, a split-second decision is often required to prevent a serious accident. A driverless car would have to instantly process the video, audio, and traffic coordinates, and then “decide” what to do. That’s a big data problem.
Technology is evolving to solve most big data problems, and the cloud is playing a key role in this process. The cloud offers virtually unlimited storage and compute, so organizations no longer need to bump up against limitations in their on-premises data warehouses. In addition, business intelligence (BI) software is becoming increasingly sophisticated, enabling organizations to extract value from data without requiring a high level of technical expertise from users.
Still, many organizations struggle with data problems, both big and small. Some continue to struggle to meet storage and compute limitations simply because they are reluctant to move their on-premises data warehouses to the cloud. However, most organizations that struggle with data simply don’t know what to do with the data they have and the vast amounts of diverse data that are now readily available. Their problem with data is that they haven’t developed the culture of curiosity and innovation required to put all the data available to good use. In many ways, this shortcoming in organizations poses the real big data problem.
Data critical thinking skills are essential for your data science teams. See how to evaluate your critical thinking.
See the process of creating a data model. Learn the three stages of data modeling basics.
With machine learning you have to create a culture that can ask data science questions.