ABSTRACT

This chapter discusses the Hadoop distributed file system and MapReduce in detail and demonstrates with simple examples. It outlines the shortcomings of MapReduce and some possible solutions to overcome. A group of networked heterogeneous and homogeneous computers that work together as a single unit to accomplish a task is called a distributed system. In simple words, a group of computers provides a single computer view. Hadoop Distributed File System (HDFS) manages big data across a cluster of machines with streaming data access pattern. It employs distributed storage to provide a single disk view and Distributed File System to provide unique global namespace on distributed storage. HDFS provides file abstraction, which means that a file beyond a storage disk size is partitioned into chunks and stored across a cluster of nodes. The unit of data storage and access in HDFS is a block, which denotes the minimum amount of data that can be stored and retrieved from HDFS.