ABSTRACT
Hadoop is the primary standard for distributed computing for at least two reasons: (1) it has the power and the tools to manage distributed nodes and clusters, and (2) it is free from the Apache Foundation. MapReduce and Hadoop Distributed File System (HDFS) are the two different parts of Hadoop. Apache lists the following projects as related to Hadoop: 1
• Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, which includes support for HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop
• Avro: A data serialization system • Cassandra: A scalable multimaster database with no single points of
failure and excellent performance • Chukwa: A data collection system for managing large distributed
systems • HBase: A scalable, distributed database that supports structured
data storage for large tables • Hive: A data warehouse infrastructure that provides data summari-
zation and ad hoc querying • Mahout: A scalable machine learning and data mining library • Pig: A high-level dataflow language and execution framework for
parallel computation • ZooKeeper: A high-performance coordination service for distrib-
uted applications
Advantages:
• We can distribute data and computation. • Tasks become independent; hence, the tasks are independent.