ABSTRACT

This paper describes the reliability demonstration activities of a high reliability disk file. This is a 5.25 inch form factor, high density disk file with one Gigabyte data storage capacity. The file is used in a wide range of applications from desktops to mid range systems. The file utilises a Small Computer System Interface (SCSI).

Traditionally in IBM the reliability of a new product has to be better than the predecessor performance in the field. However, this requirement has been limited to demonstration through the traditional ‘paper’ reliability prediction based on the summation of component failure rates. Only limited reliability testing is carried out on production samples. This strategy has been successful for many products, which were manufactured in small volumes in traditional batch production. This was not successful for the predecessor disk file which was produced in large volumes through the Continuous Flow Manufacture (CFM).

The primary lesson learned was the need to demonstrate reliability through testing early production samples. In order to execute this, a Reliability Demonstration Test Plan was developed. This included testing of early ‘current’ production samples to assess early life and collation of reliability data of files used at sister plants in system development and software proving. The test samples were subjected to temperature, voltage and power on/off stress conditions within the operating specifications of the product. However, as would be expected with any new strategy, in particular that influences production, it was necessary to persuade the management the importance of reliability demonstration through testing.

Approximately two million power on hours were accumulated on files run in system tests. No failures beyond the first month 33were observed. Excluding the first month, the file Intrinsic Failure Rate (IFR) was demonstrated with 95% confidence. More than anticipated early life failures were observed. These were due to latent design weaknesses and quality related problems. In recognition of these, the test activity was redirected to test large samples for shorter periods, Over 2000 files being stressed. The Test Analyse And Fix (TAAF) activity was successfully administered by a Failure Review Board (FRB). Twelve design and process changes were introduced. This resulted in improving the file early reliability by a factor of twenty. The Duane reliability growth model was suitable to describe the file early life reliability improvement.