Data Locality and Dependency for MapReduce

doi:10.1201/9781315155678-16

ABSTRACT

This chapter presents dependency-aware locality for MapReduce (DALM) for processing the real-world input data that can be highly skewed and dependent. DALM accommodates data dependency in a data-locality framework, organically synthesizing the key components from data reorganization, replication, and placement. The chapter explores the single-server virtualization overhead for MapReduce-based big data processing, and vLocality is also beneficial to mitigate the overhead of data center scale by reducing cross-server traffic. A typical virtual MapReduce cluster. A core switch is a high-capacity switch interconnecting top-of-rack switches, which have relatively low capacity. DALM can be easily incorporated into this conventional three-level design with a physical machine cluster. Such machine virtualization tools as Xen, KVM, and VMware allow multiple virtual machines running on a single physical machine, offering highly efficient hardware resource sharing and effectively reducing the operating costs of Cloud providers. The chapter evaluates DALM through extensive simulations and test-bed experiments. It compares DALM with state-of-the-art data locality solutions.