Multi-dimensional data analysis optimization | 15

ABSTRACT

In this chapter, we will present some methods to improve the performance of MapReduce-basedMultiple Group-by query processing [Pan et al., 2010c,Pan et al., 2010a, Pan et al., 2010b]. In a distributed shared-nothing architecture, like the MapReduce system, there are two approaches to optimize query processing. The Àrst one is to choose optimal job-scheduling policy in order to complete the calculation within minimum time. Load balancing, data skew, straggler node etc. are the issues involved in job-scheduling. The second approach focuses on the optimization of individual jobs constituting the parallel query processing. Individual job optimization needs to consider the characteristics of involved computations, including the low-level optimization of detailed operations. The optimization of individual jobs sometimes affects the job-scheduling policy. Although the two optimizing approaches are at different levels, they inÁuence each other. In this chapter, we will Àrst discuss the optimization work for accelerating individual jobs during the parallel processing procedure of the Multiple Group-by query. Then, we will identify the performance affecting factors during this procedure. The performance measurement work will be presented. The execution time estimation models are proposed for query executions based on different data partitioning methods. An alternative compressed data structure will be proposed at the end of this chapter. It enables one to realize more Áexible job scheduling.