Reliability in Distributed Systems | 8 | Distributed System Design

ABSTRACT

This chapter discusses various methods of handling node, communication, Byzantine, and software faults in a distributed system. The concept of dependability was initially proposed by Laprie. In general, the concept of dependability includes the three components: reliability, safety, and security. Three basic fault handling methods are: active replication, passive replication, and semi-active replication. The chapter considers only software-based fault handling. Two software models are generally used for this purpose: process-based model, and object-based model. Stable storage is a logical abstraction for a special storage that can survive system failure. That is, contents of stable storage are not destroyed or corrupted by a failure. An atomic action is a set of operations which are executed indivisibly by hardware. That is, either operations are completed successfully and fully or the state of the system remains unchanged.