Fault Tolerance | 128 | v3 | The Electrical Engineering Handbook

ABSTRACT

Fault tolerance is the ability of a system to continue correct performance of its tasks after the occurrence of

hardware or software faults. A fault is simply any physical defect, imperfection, or flaw that occurs in hardware

or software. Applications of fault-tolerant computing can be categorized broadly into four primary areas: long-

life, critical computations, maintenance postponement, and high availability. The most common examples of

long-life applications are unmanned space flight and satellites. Examples of critical-computation applications

include aircraft flight control systems, military systems, and certain types of industrial controllers.

Maintenance postponement applications appear most frequently when maintenance operations are extremely

costly, inconvenient, or difficult to perform. Remote processing stations and certain space applications are

good examples. Banking and other time-shared systems are good examples of high-availability applications.

Fault tolerance can be achieved in systems by incorporating various forms of redundancy, including hardware,

information, time, and software redundancy [Johnson, 1989].