ABSTRACT
Fault tolerance is the ability of a system to continue correct performance of its tasks after the occurrence of
hardware or software faults. A fault is simply any physical defect, imperfection, or flaw that occurs in hardware
or software. Applications of fault-tolerant computing can be categorized broadly into four primary areas: long-
life, critical computations, maintenance postponement, and high availability. The most common examples of
long-life applications are unmanned space flight and satellites. Examples of critical-computation applications
include aircraft flight control systems, military systems, and certain types of industrial controllers.
Maintenance postponement applications appear most frequently when maintenance operations are extremely
costly, inconvenient, or difficult to perform. Remote processing stations and certain space applications are
good examples. Banking and other time-shared systems are good examples of high-availability applications.
Fault tolerance can be achieved in systems by incorporating various forms of redundancy, including hardware,
information, time, and software redundancy [Johnson, 1989].