ABSTRACT

The term “resilience” has become an overarching term that describes the reliability of both software and hardware in the high performance computing field. Resilience came about, during a philosophical shift away from merely tolerating faults and toward being able to ride through failure. While there certainly are examples of resilience, particularly in hardware, there are relatively few examples of true resilience at higher levels of the software/hardware stack.