ABSTRACT

In the big data era, traditional relational database systems cannot effectively handle the big volume of data due to their limited scalability. People are seeking new ways to tackle the problem of big data. After Google published its work of MapReduce, Hadoop (an open-source implementation of 192MapReduce) has risen to be the de facto standard tool for big data processing. People have applied Hadoop to various big data application scenarios, which show the power of Hadoop. However, the 1.0 version of Hadoop supports only one computing model of MapReduce, which is not efficient enough to provide higher performance.

Now Hadoop has evolved into Hadoop 2.0 (YARN). Hadoop 2.0 has a newly designed architecture, which separates resource management and job scheduling. Hadoop 2.0 supports other computing models besides MapReduce, including complex computing work expressed in a DAG (directed acyclic graph). People also try to improve the execution layer of Hadoop, such as the work of Tez from Hortonworks, to provide lower latency.

In the meantime, AMP lab of California University at Berkeley brought out Spark, which now draws more and more attention from academia and industry. The Spark ecosystem includes the core and four major components surrounding it, including Spark SQL for structured data processing, Spark Streaming for stream data processing, MLLib for machine learning, and GraphX for graph data processing. In essence, Spark and Hadoop provide similar functionalities; however, in some application scenarios, Spark outperforms Hadoop by many times.

Hadoop and Spark are two ecosystems. Both of them can play the central role in future big data warehouses. On one hand, they are replacement to each other; on the other hand, they can be used together to get work done. For example, people can use Spark in exploratory analysis and get instant feedback, and use Hadoop to consolidate all data in one place, and conduct a thorough analysis on the whole data set.

In this chapter, we analyze limitations of different technologies and the business requirements behind the continuous innovations. We also try to point out some lessons that the database research community and the database industry should have learned.