You can’t have a discussion about big data or data science these days without mentioning Hadoop, but when you ask people what Hadoop is, they struggle to define it. I define Hadoop as a software framework for harnessing the power of distributed, parallel computing to store, process, and analyze large volumes of structured, semi-structured, and unstructured data.

Imagine having to write a 30 chapter book. You have two options — write it yourself or write the one chapter you know the most about and farm out the other 29 chapters to various experts on those topics. Which option would deliver the best book in the least amount of time? Obviously, distributing the work among 30 experts would expedite the process and result in a higher quality product. The same is true with distributed, parallel computing; instead of having one computer do all the work, you distribute tasks to dozens, hundreds, or even thousands of powerful “commodity” servers that perform the work as a collective. This group of servers is often referred to as a Hadoop cluster, and Hadoop coordinates their efforts.

Three Key Features
Hadoop has three key features that differentiate it from traditional relational databases, which make it an attractive option for working with big data:

Advantages of Hadoop
Hadoop has several advantages over traditional relational databases, including the following:

Disadvantages of Hadoop
Hadoop certainly helps to overcome many of the challenges of big data, but it is not, in itself, an ideal solution. It is weak in the following areas:

Commercial versions of Hadoop are available to help overcome these and other limitations of the open source Apache Hadoop, including Amazon Web Services Elastic MapReduce, Cloudera, Hortonworks, and IBM InfoSphere BigInsights.