ABSTRACT

Hadoop is the primary standard for distributed computing for at least two reasons: (1) it has the power and the tools to manage distributed nodes and clusters, and (2) it is free from the Apache Foundation. MapReduce and Hadoop Distributed File System (HDFS) are the two different parts of Hadoop. Apache lists the following projects as related to Hadoop: 1

• Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, which includes support for HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop

• Avro: A data serialization system • Cassandra: A scalable multimaster database with no single points of

failure and excellent performance • Chukwa: A data collection system for managing large distributed

systems • HBase: A scalable, distributed database that supports structured

data storage for large tables • Hive: A data warehouse infrastructure that provides data summari-

zation and ad hoc querying • Mahout: A scalable machine learning and data mining library • Pig: A high-level dataflow language and execution framework for

parallel computation • ZooKeeper: A high-performance coordination service for distrib-

uted applications

Advantages:

• We can distribute data and computation. • Tasks become independent; hence, the tasks are independent.