ABSTRACT

Hadoop, one of the most popular map-reduce frameworks, uses a slight variation that makes the reduce step also potentially parallelizable. The main idea is to regroup, or reshuffle, the list of results from the map step so that the regroupings are amenable to further mapping of a reducible function. The style presented here is the style used by well-known MapReduce frameworks such as Hadoop, as they try to parallelize the data-intensive problems as much as possible. The process of parallelizing a complex data-intensive problem can involve several layers of regrouping. At the data center scale, parallelization is done by sending blocks of data to servers that perform simple tasks. This form of MapReduce was popularized by Google in the early 2000s. Since then, several data center-wide MapReduce frameworks have emerged, some of them open source.