ABSTRACT

Lei Yu Applied Mathematics and Systems Laboratory, Ecole Centrale Paris, Grande Voie des Vignes, 92295 Chaˆtenay-Malabry, France

Fre´de´ric Magoule`s Applied Mathematics and Systems Laboratory, Ecole Centrale Paris, Grande Voie des Vignes, 92295 Chaˆtenay-Malabry, France

Along with the deployment of more and more heterogeneous clusters, grid computing has become an increasingly popular solution for leveraging existing IT infrastructure to optimize computing resources and manage data and computing workloads. Lots of grid projects have been launched to build a national problem-solving system on the grid, such as GrADS [Berman et al., 2001 ] and DIET [Caron and Desprez, 2006 ]. These projects aim to connect the nation’s computers, databases and instruments in a seamless grid, supporting emerging computation-rich application concepts such as remote computing, distributed supercomputing, tele-immersion, smart instruments and data mining. In these large scale systems, the scheduling and fault tolerance are obviously key technical obstacles to be overcome. According to the presentation of Hamscher and his colleagues [Hamscher et al., 2000 ], the metascheduling architecture can be included into three principal schemas: centralized scheduling, hierarchical scheduling and distributed scheduling. The main

that policies can be used for local and global job scheduling, the communication bottleneck of centralized scheduling is prevented and the system is more scalable. But in the hierarchical and distributed structure, each resource has its own administrative domain. These resources are geographically distributed and are gathered using a WAN or even Internet. Those characteristics lead these scheduling structures to be more error prone than other computing environments. A fault tolerant mechanism should be proposed to detect automatically the failure of components and to ensure that the failure will not affect the whole grid system.