ABSTRACT

Most of today's distributed applications are becoming progressively data-intensive, thus necessitating data-parallel frameworks, such as Hadoop and its multitudes of offshoots, for managing such big-data requirements. For such data-intensive applications, first data needs to be placed among clusters and thereafter computations are scheduled to cluster nodes where data are placed. However, Hadoop's native data placement (HNDP) ignores information pertaining to data distribution and schedules computation within a Hadoop cluster oblivious of data requirements of jobs. This chapter advocates employing a data location aware application scheme that improves performance by reducing runtime overhead of data transfer among clusters. The proposed scheme, hereafter referred to as data aware computation scheduler (DACS), has been realized over OpenStack cloud environment with Savanna Hadoop. This chapter presents the comparative performance of the proposed DACS vis-à-vis HNDP policy.