ABSTRACT

This chapter investigates online optimal deployment of big data analytics jobs across geo-distributed regions, with unknown and uncertain information of inter-datacenter bandwidths and task execution durations on different virtual machines (VM). Geo-distributed big data analytics systems, which extend a single cluster-based MapReduce, Spark, or parameter server-based system to the Wide Area Network (WAN), to process data generated in different geographic locations. The centralized processing approach is time-consuming due to transmitting large volumes of data over bandwidth-constrained WAN links, and is costly for resource consumption. The chapter provides an online learning-based algorithm which does not rely on offline training, but can learn the near-optimal decisions for placing each type of jobs over time. The algorithm to compute task deployment in each stage of each job finishes within 1600ms for 500 tasks, 10 data centers, and 9 VM types. The chapter also investigates the multiple job scheduling problem with resource constraints in similar cases of runtime uncertainties.