chapter  3
28 Pages

Optimal Periodic Software Rejuvenation Policies in Discrete Time—Survey and Applications

ByTadashi Dohi, Junjun Zheng, Hiroyuki Okamura

Present-day applications in computer systems impose stringent requirements in terms of software dependability, because system failure, caused by software failure in almost all cases, may lead to a huge economic loss or risk to human life. A guaranteed fulfillment of these requirements is very difficult, especially in applications with nontrivial complexity. In recent years, considerable attention has been paid to continuously running software systems whose performance characteristics are smoothly degrading in time. When a software application executes continuously for a long period of time, some of the faults cause software to age due to the error conditions that accrue with time and/or load. This phenomenon is called software aging and can be observed in many original software systems [1–6]. One common experience suggests that most software failures are transient in nature [7]. Since transient failures disappear if the operation is retried later in slightly different context, it is difficult to characterize their root origin. Therefore, the residual software faults are obvious in the operational phase. Grottke and Trivedi [8] classify several software bugs and point out that the resource exhaustion in computer systems causes the software aging. A complementary approach to handle transient software failures is called software rejuvenation [9] which can be regarded as a preventive and proactive solution that is particularly useful for counteracting the phenomenon of software aging. It involves stopping the running software occasionally, cleaning its internal state, and restarting it. Cleaning the internal state of software may involve garbage collection, flushing operating system kernel tables, reinitializing internal data structures, etc. An extreme, but well-known example of rejuvenation is a hardware reboot. In this way, software rejuvenation is becoming much popular as one of the light weighted software fault tolerant techniques.