ABSTRACT

Distributed systems are highly popular and increasingly powerful when compared with legacy systems, which are generally monolithic and massive. Distributed systems are gaining prominence due to their unique properties such as high availability and scalability. But the occurrence and recurrence of faults is an unavoidable factor in any distributed computing system. Failure is expected often in both hardware and software components of distributed systems. Therefore, for attaining the prominence and criticality, the aspect of fault tolerance is being considered as the central theme and core requirement for the establishment and sustenance of the distributed computing paradigm. That is, somehow the trait of fault tolerance has to be embedded and embodied in distributed systems. Also, the fault tolerance attribute has to be guaranteed to neatly achieve the abovementioned distributed computing advantages. Actually, the fault tolerance feature refers to the algorithmic control of various participating and contributing components to render the required services even in the presence of a fault or failure. This is typically accomplished by having redundant instances for each of the pivotal modules of any distributed system, which establishes and sustains distributed computing.

Fault tolerance is the critical measure for any business workload and information technology(IT) service to continue functioning reliably by proactively and pre-emptively finding and isolating any fault. Fault tolerance in distributed systems is therefore an important goal and a complex/challenging task to achieve. In this chapter, we present a brief introduction to distributed systems and how they can be made fault tolerant to be right and relevant for their users. There are plenty of fault tolerance approaches and algorithms; hence, software engineers and architects have to be aware of those techniques and traits to succeed in the emerging era of distributed computing