ABSTRACT

An inherent drawback of non-dedicated computing resources is low and uncontrollable availability of individual hosts. This phenomenon limits severely the range of scenarios in which such resources can be used and adds a significant deployment overhead. Approaches to compensate for these difficulties use redundancy-based fault tolerance techniques supported by modeling and prediction of availability. In the first part of this chapter we discuss a variety of modeling techniques ranging from probability distributions to machine learning-based prediction techniques. Subsequently we focus on methods to provide resource-efficient and cost-minimizing fault-tolerance. Here redun-

TABLE 9.1: Lower Bounds on Probability of a Failure of at Least One Host in a Group of n Hosts (with individual failure probabilities of p1, . . . , pn)

Number of Hosts n 2 3 4 5 6 8 10 16 32

pi ≥ 0.1 0.19 0.27 0.34 0.41 0.47 0.57 0.65 0.81 0.97 0.2 0.36 0.49 0.59 0.67 0.74 0.83 0.89 0.97 1.00 0.4 0.64 0.78 0.87 0.92 0.95 0.98 0.99 1.00 1.00

dancy is mandatory to mask the outages of individual machines, yet on the other hand it might increase overhead and resource cost. We describe how availability models help here to obtain statistical guarantees of (collective) availability and how total costs of the resources can be balanced against reliability properties. We also consider the issue of adjusting application architectures in order to tolerate partial resource failures. This promises to broaden the type of applications deployed on voluntarily computing resources from embarrassingly parallel jobs to Map-Reduce-type applications or even Web services.