Hadoop is a well established software framework which analyse structured/unstructured big data and distribute applications on thousands of servers. Hadoop was created in 2005 and after Hadoop several projects around in the Hadoop space appeared that tried to complement it. Sometimes those technologies overlap with each other and sometimes they are partially complementary. I will try to describe a brief map of them.
The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Apache Hadoop Project brings an open source MapReduce Implementation.
The scalability that is needed for big data processing is supported by their Hadoop Distributed File System (HDFS). Data in a Hadoop cluster is broken down into blocks and distributed throughout the cluster. Although there are many alternatives to the HDFS layer (some of them known by NoSQL), it is well established in the present scenario. For this reason in this post I will only describe the technologies related with the data processing layer that can be supported by HDFS. The data management layer alternatives will be considered in a future post.
Beyond HDFS, the entire Apache Hadoop Ecosystem is now commonly considered to consist of a number of related projects as well. There are a main group of Apache technologies built to run on top of Hadoop clusters known as Hadoop Ecosystem. Three important are Apache Hive and Apache Pig to integrate data processing and warehousing capabilities; and Apache Sqoop which integrate HDFS with relational data stores. Another important Apache technologies that are part of the open source Hadoop ecosystem are: Apache Mahout is an open source machine-learning library that facilitates building scalable matching learning libraries; Apache Flume is a distributed service for efficiently collecting, aggregating, and moving large log data amounts to HDFS; Apache ZooKeeper is a high-performance coordination service for distributed applications; Apache Avro and Apache Thrift are a two very popular data serialization systems; among other less important projects. Some projects replace MapReduce programming model, for instance, Apache Giraph that is used for building incremental reverse indexes instead of MapReduce.
Managing compute resources
Although MapReduce’s batch approach was a driving factor in initial adoption of the hadoop, its inability to multitask and provide satisfactory real-time processing has been a difficulty for developers in recent years. For this reason apperared Apache Hadoop YARN (Yet Another Resource Negotiator), a cluster management technology. Basically this new layer splits key functions into two separate daemons, with resource management in one, and job scheduling and monitoring in the other, broadening Hadoop’s processing features layer. With YARN the community generalized Hadoop MapReduce to provide a general-purpose resource management framework wherein MapReduce became merely one of the applications that could process data in a Hadoop cluster. However developers required a more general data-processing application to the benefit of the entire ecosystem, and this is the role of Apache Tez. Tez is a generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez eliminates unnecessary tasks, synchronization barriers, and reads from and write to HDFS. Tez is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem.
An important requirement for many current big data applications is processing streaming data in real time. With this purpose appeared Apache Storm. Storm is a distributed real-time computation system for processing fast and large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop. Storm is by far the most widely used real-time computation system in this moment. Mesosphere released a similar project for Apache Mesos (an alternative to YARN), a cluster manager that simplifies the complexity of running applications on a shared pool of servers making it easier to run Storm on Mesos clusters. Often Storm goes together with Apache Kafka as a distributed message broker used to store/send/subscribe data streams. An alternative to Storm (less widespread) for streams of data is Apache S4. Related projects are Suro, a pipeline service for large volumes of event data that can be used to dispatch events for both batch and real-time. Summingbird is a open source library from Twitter that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms. Finally, related with streaming we can found SAMOA, a distributed streaming machine-learning framework for mining big data streams.
Special attention requires Apache Spark, a framework that will play an important role in the Big Data arena. It has been extensively featured in this blog (please refer my previous posts related with Spark). Another equivalent project are Apache Stratosphere. Both are distributed general-purpose compute engines that offer user-facing APIs, and they can both run in a Hadoop cluster on top of HDFS and YARN. There are several other projects in the Hadoop space that offer user-facing abstractions as Cascading or Scalding. Cascading is a Java-based framework that abstracts and hides complex implementation details involved in the writing of big data applications. Scalding is an extension to Cascading that enables application development with Scala, a powerful language for solving functional problems that is very popular in Big Data community.
I hope that you find useful this post and the links to resources. Please, let me know if you find any mistake or you have any suggestions for improve it. Thank you!. I also would like to thank Marc de Palol and Nico Poggi for their comments to the first draft of this post.
Links to the related projects: