Checkpointing Code for Restartability with HDF5

doi:10.1201/9781315382395-22

ABSTRACT

Checkpointing is a technique for providing fault-tolerance to a code, particularly when the code runs on a complex system with multiple potential points of failure, and which consumes considerable time and/or resources. It involves taking a snapshot of the state of the program and the values of all variables that will be accessed as the program continues to run and saving them, typically to disk, so that if a failure (power outage, hardware fault, network outage, etc.) occurs, the program may be restarted from this saved state information. In the move toward ever more parallelism, the issues of fault tolerance become more poignant. The larger and more complex a computing system is, and the more processors, circuits, and others, the greater the possibility of failure at some point in the system during the course of simulation.