ABSTRACT

Nowadays, scientific computing jobs usually last several days to several months, for example, the protein folding program running on the BlueGene needs several months (Du, 2008), these existing MTBFs of these systems obviously cannot meet the application needs. Taking into account the above problems, as an important method to improve the system reliability, fault tolerant technology deserves more in-depth and valuable investigation.