ABSTRACT

The cloud, with its tendency to use commodity hardware and virtualization and with the potential for enormous scale, presents many additional challenges to designing reliable applications. In all engineering disciplines, reliability is the ability of a system to perform its required functions under stated conditions for a specified period of time. In software, for application reliability, this becomes the ability of a software application and all the components it depends on (operating system, hypervisor, servers, disks, network connections, power supplies, etc.) to execute without faults or halts all the way to completion. But completion is defined by the application designer. Even with perfectly written software and no detected bugs in all underlying software systems, applications that begin to use thousands of servers will run into the mean time to failure in some piece of hardware, and some number of those instances will fail. Therefore, the application depending on those instances will also fail.