What is Hadoop, anyway?

Andreas Neumann, Chief Architect, develops big data software at Cask and has previously done so at places that are known for massive scale. Prior to Cask, he was Chief Architect for Hadoop at Yahoo!, and he was previously a research engineer at Yahoo! and a search architect at IBM.

Recently my co-worker Derek posted an article about which versions of Hadoop infrastructure components are included in the various distributions. One of the reactions was this tweet, questioning whether such things as HBase and Spark should be considered part of core Hadoop:

YACD

Even though this tweet was purely about core Hadoop, it made me think – what makes a technology a Hadoop technology? What is Hadoop, anyway? So I did a quick search of the organized information of the world, and before I even finished typing my query, I received these suggestions:

What is

… which might suggest that Hadoop is some kind of disease. Luckily I knew that it is not, and I searched for “what is hadoop”, to find out that there are dozens of people, companies, and organizations who have given their own answer to the question. Knowing that I am not very likely to do a better job than all these people, I thought I should at least check the collected wisdom of the internet crowd. Here is what Wikipedia says:

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

So let’s see whether this definition helps us further. To begin with, Hadoop is an open source project at the Apache Foundation, a place where a diverse community of contributors collaborates on the development of a project. That is certainly true for the Hadoop project itself, which includes the HDFS filesystem, YARN, and Map/Reduce. What about other technologies often considered part of the “Hadoop Stack”, such as HBase or Hive? They, too, are Apache open-source projects, are very tightly coupled with Hadoop itself, and are included in every Hadoop distribution that I know of. Fortunately, Wikipedia helps me out again:

The term “Hadoop” has come to refer … also to the “ecosystem”, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark, and others.

So the question I should try to answer is really: What is part of the Hadoop ecosystem? And what should be the criteria to include a technology? Let’s see.

Open source: Most of the technologies in the Hadoop space are open source, and a lot of them are actually Apache projects, for example, Apache HBase, Apache Hive and Apache Spark. These projects have healthy, diverse communities and few people would hesitate to include them in the Hadoop ecosystem. There are also some technologies where the community has diverged, resulting in several open-source projects, each backed by only one of the major Hadoop distributions. An example of this is Hadoop security, with at least three different Apache projects: Knox, Sentry and Ranger, all of which provide authentication and authorization; Sentry is included only in the Cloudera distribution, whereas Ranger can only be found in the Hortonworks distro. For these projects, can we really say that the Hadoop community collaborates on them? I cannot answer that tricky question. Other open-source technologies, such as Cloudera’s Impala, are part of a Hadoop distribution but not Apache projects, and are obviously maintained by contributors from a single organization. And then there is MapR’s distribution, which is not open source. Yet MapR is commonly referred to as a Hadoop distributor.

Apparently we cannot use open source as a criterion for whether something is part of the Hadoop ecosystem. What else could we use? Wikipedia mentions that Hadoop is written in Java. Is that something typical for Hadoop? Is Hadoop a Java technology? Again, this is certainly true for core Hadoop (even though it relies on some native libraries for compression and encryption) and many other technologies like HBase and Hive. However,

  • Apache Spark (which Wikipedia includes in the ecosystem) is written in Scala, and so is Kafka, which is also included in Cloudera’s CDH.
  • Cloudera’s Impala and large parts of MapR’s distribution are written in C/C++.
  • Hortonworks’ distribution includes Apache Storm, which is written in Clojure

Although one might debate for other reasons whether these components are actually Hadoop technologies, my conclusion is that being implemented in Java is not a distinguishing factor.

For a second I am tempted to include everything that is part of a Hadoop distribution. But that is a seriously flawed idea, because now I have to answer the question: “What is a Hadoop distribution”, which will quickly get me into many controversies.

Perhaps I should attend a Hadoop conference – the technologies presented there must be Hadoop technologies, right? Attending Hadoop Summit last month quickly disillusioned me: There were exhibits by cloud service providers, database companies, virtualization vendors, storage manufacturers, networking companies, you name it. Is it really enough to have a t-shirt with an elephant, to call yourself a Hadoop technology? I feel that this cannot be my way of defining Hadoop.

Why is it so difficult to define this? For other platforms it is a lot easier: “What is a Windows application?” – An application that runs on Windows. Easy, isn’t it? But wait: Didn’t @acmurthy name YARN the “datacentre Hadoop operating system”? Perhaps it is this easy: If it runs in YARN, then it is a Hadoop technology. Hmmm… That will exclude HDFS, HBase and anything built before YARN, but include a myriad of Big Data technologies that exists today. And some of those technologies can run in YARN (such as Spark or Storm) but were not originally written for YARN. Apache Spark, for example, can run on top of YARN, but is also considered a core part of the Mesos ecosystem.

Does this mean Hadoop has become a de-facto synonym – or a hyponym – for Big Data? Another look at Google shows that this might be true:

What is Hadoop

I, for my part, am giving up at this point. What is Hadoop, anyway? Does it really matter to agree on the answer to that question? In the end, everybody who builds an application or solution on Hadoop must pick the technologies that are right for the use case.

Integrating these technologies is like solving a puzzle with many pieces, some of which are still missing, and some of which look as if they belong to a different puzzle. To make this integration easier, we are building CDAP: A platform for developers of applications that span across some or all of these different technologies. CDAP solves the puzzle; you focus on your application.

What do you think is a Hadoop technology? What would you like to see supported by CDAP? Comment on this post and help us make CDAP better!

<< Return to Cask Blog