ABSTRACT

As the world emerged into an era of Big Data, demand grew for a computing paradigm that (a) is generally applicable and (b) works on distributed data. The latter term means that data is physically distributed over many chunks, possibly on different disks and maybe even different geographical locations. Having the data stored in a distributed manner facilitates parallel computation — different chunks can be read simultaneously — and also enables us to work with data sets that are too large to fit into the memory of a single machine. Demand for such computational capability led to the development of various systems using the MapReduce paradigm.