Enhancing fault tolerance in MapReduce tasks

doi:10.1201/9780429461903-73

ABSTRACT

MapReduce is a programming model and a runtime environment for big-data processing over distributed systems (e.g. clusters, cloud and grids). Task failure has become a critical issue and could increase the cost of jobs and affect resource utilization in MapReduce. Currently, the MapReduce fault-tolerance mechanism is based on rescheduling failed tasks on other nodes, where they are re-executed, and this rescheduling affects resource utilization, as well as execution time. In this paper, a new rollback-recovery model called Pessimistic Log-based Rollback (PLR) is introduced for MapReduce fault tolerance. The central principle of the proposed PLR model is a logging process to enable rollback when failure occurs by recording the task as the determinant of the log report. When a task fails, the proposed PLR model will reactivate the execution of this task on the same node starting from the last state before failure, which optimistically can solve the MapReduce task failure problem. In the worst case, the task will be rescheduled into another node for re-execution. The experimental results for the proposed PLR model show that MapReduce performance is improved in the case of failure, reducing execution time by approximately 35%.