ABSTRACT

Organization theorists and practitioners alike have become greatly interested in high reliability in the management of large hazardous technical systems and society’s critical service infrastructures. But much of the reliability analysis is centered in particular organizations that have command and control over their technical cores. Many technical systems, including electricity generation, water, telecommunications, and other critical infrastructures, are not the exclusive domain of single organizations. Our chapter is organized around the following research question: How do organizations, many with competing if not conflicting goals and interests, provide highly reliable service in the absence of ongoing command and control and in the presence of rapidly changing task environments with highly consequential hazards? We analyze electricity restructuring in California as a specific case. Our conclusions have surprising and important implications both for high reliability theory and for the future management of critical infrastructures organized around large technical systems.

Interest among theorists in organizational reliability – the ability of organizations to manage hazardous technical systems safely and without serious error – has grown dramatically in recent years (LaPorte and Consolini, 1991; Roberts, 1993; Sagan, 1993; Schulman, 1993a; Rochlin and von Meier, 1994; Weick, Sutcliffe and Obstfeldt, 1999; Perrow, 1999; Sanne, 2000; Beamish, 2002; Evan and Manion, 2002). Interest has also heightened over “critical infrastructures” and their reliability in the face of potential terrorist attack. Assault on any of the critical infrastructures for power, water, telecommunications, and financial services entails great consequences for their users as well as the other interdependent critical infrastructures (Homer-Dixon, 2002; Mann, 2002).

A momentous debate has been taking place among policy and management experts over how best to protect critical infrastructures against attack. What are their key vulnerabilities? To what extent should the operation of critical services be centralized or decentralized? In a report released by the US National Academies of Science and Engineering and Institute of Medicine, it is argued that for a variety of infrastructures, including energy, transportation, 464information technology, and health care, the interconnectedness within and across systems makes the infrastructure vulnerable to local disruptions that could lead to catastrophic failure” (NRC, 2002).

The highly reliable management of large-scale critical service systems presents a major paradox: demands for ever higher reliability, and not only against terrorist assault, surround these services as we grow more dependent on them, yet at the same time, conventional organizational frameworks associated with high reliability are being dismantled. Deregulation in air traffic, electricity, and telecommunications has led to the unbundling and breakup of utilities and other organizations operating these systems. Increasing environmental requirements have brought new units and conflicting mandates into the management of large technical water systems (van Eeten and Roe, 2002). Dominant theories would predict that high reliability is unlikely or at great risk in the rapidly changing systems. In particular, high reliability in providing critical services has become a process that is achieved across organizations rather than a trait of any one organization. Critical infrastructures in communication, transportation, and water resources all display structurally and geographically diverse constituent elements. The inputs and disturbances to which they are subject are also diverse and place them under increasing pressure of fragmentation (e.g., deregulation and regulatory turmoil). Yet they are mandated to provide not just critical services to society, but reliably critical services, notwithstanding their turbulent task environments.

Examples of successful “high reliability management” continue to come forward: mobile telecom services are becoming more reliable across ever more complex hardware and service demands; the lights by and large stayed on during the California electricity crisis; Y2K passed without major incident; fairly rapid recovery in financial services from 9/11 was secured; and large hydropower systems reconcile hourly the conflicting reliability mandates across multiple uses of water.

How to explain such successful management under conditions where theory tells us to expect otherwise? By way of explanation we present a case study on how reliability of critical services was maintained in the restructured California electricity sector and the 2000–2001 electricity crisis. 2 This case study has been chosen for two reasons.

First, both Normal Accident Theory (NAT; Perrow, 1999) and most of the earlier High Reliability Organizations (HROs) research would predict that the restructuring of the California electricity sector should have demonstrably undermined the reliable provision of electricity. For its part, NAT would see the coupling of the state’s two major electricity grids into a newly configured grid, with altogether novel and higher flows of energy, as an increase in the technology’s tight coupling and complex interactivity. The probability of cascading failures would increase accordingly. For its part, the earlier HRO research would have concluded similarly. Since electricity HROs, especially at nuclear power plants, must stabilize both inputs and outputs, reliability would necessarily suffer to the extent that this stability was thrown into doubt (Schulman, 1993a). Indeed, subsequent events in California seemed to confirm both theories, as the chief feature of the California electricity crisis has been taken by many to be the unreliability of electricity provision.

Yet, notwithstanding the popular view of rolling blackouts sweeping California during its electricity crisis, in aggregate terms – both in hours and megawatts (MW) – blackouts were minimal. While no figures are available for the baseline before the crisis, it is important to note that rolling blackouts occurred on 6 days during 2001, accounting for no more than 30 hours. Load shedding ranged from 300 to 1000 MW in a state whose total daily load averaged in the upper 20,000 to lower 30,000 MW. The aggregate amount of megawatts that was actually shed during these rolling blackouts amounted to slightly more than 14,000 megawatt-hours (MWh), the rough equivalent of 10.5 million homes being out of power for 1 hour in a state having 465some 11.5 million households – with business and other nonresidential customers remaining unaffected. In short, the California electricity crisis had the effect of less than an hour’s worth of outage for every household in the state in 2001. Why the lights by and large actually stayed on – and reliably stayed on – was due, we argue, to factors related to a special type of reliability management.

A second reason for focusing on the California case study is that it offers a challenge to the NAT notion of “complex interactivity and tight coupling” as an inevitable source of normal accidents in large technical systems. At least in the case of the electricity critical infrastructure, complexity and tight coupling can actually be important resources for high reliability management.