ABSTRACT

George Bosilca Innovative Computing Laboratory University of Tennessee, Department of Electrical Engineering and Computer Science bosilca@cs.utk.edu

Julien Langou University of Colorado at Denver and Health Sciences Center, Mathematical Sciences Department julien.langou@cudenver.edu

As the unquenchable desire of today’s scientists to run ever-larger simulations and analyze ever-larger data sets drives the size of high-performance computers from hundreds, to thousands, and even tens of thousands of processors, the mean-time-to-failure (MTTF) of these computers is becoming significantly shorter than the execution time of many current high performance computing

applications. Even making generous assumptions on the reliability of a single processor or

link, it is clear that as the processor count in high-end clusters grows into the tens of thousands, the mean-time-to-failure of these clusters will drop from a few years to a few days, or less. The current DOE ASCI computer (IBM Blue Gene L) is designed with 131,000 processors. The mean-time-to-failure of some nodes or links for this system is reported to be only six days on average [8].