ABSTRACT

In Chapter 2, we examined indices by which one can determine the extent of agreement between two or more than two methods of evaluating test subjects, where neither method by itself can be accepted as a standard. The methods being compared may take a variety of forms such as: diagnostic devices, raters or examiners using the same procedure; several independent applications of a given instrument or procedure by one rater. Although the raters may differ, test-retest studies of reliability generally have the same basic design. That is, methods of evaluation are used on one group of subjects at the same time or within an acceptable time interval. The assumption is made that the subject being evaluated does not vary from one method to the other. The methods and models discussed therefore are applicable to evaluating continuous scale measurements (e.g., blood pressures, glucose level, bacterial counts, etc.). This and the following chapters are devoted to methods of evaluating agreement among several raters for categorical assignments. Here are some examples:

In a methodological study conducted for the U.S. National Health Survey, two different questionnaire forms were administered to the same respondents within a 7 to 10 day interval in an effort to determine the degree of agreement between forms in the elicited response concerning the presence or absence of certain disease condition among respondents. In another study by Westlund and Kurland (1953), two neurologists reviewed the same set of selected medical records of potential multiple sclerosis patients and classified each of the individuals involved into one of four categories ranging from certain to doubtful multiple sclerosis. The purpose here was to determine the extent to which trained neurologists agreed in their diagnosis of multiple sclerosis based on a medical record review.