ABSTRACT

The number of computing elements in large distributed systems is rapidly increasing. Failures and perturbations are becoming more like expected events, than catastrophic exceptions. External intervention to restore normal operation or to perform a system configuration is difficult to come by, and it will only get worse in the future. Therefore, means of recovery have to be built in.