ABSTRACT

Intercoder reliability is the most often used quantitative indicator of measurement quality in content studies. Researchers in psychology, sociology, education, medicine, marketing, and other disciplines also use reliability to evaluate the quality of diagnosis, tests and other assessments. Many indices of reliability have been recommended for general use, and this article analyzes 22 of them, which are organized into 18 chance-adjusted and four nonadjusted indices. The chance-adjusted indices are further organized into three groups, including nine category-based indices, eight distribution-based indices, and one that is double based, on category and distribution.

The main purpose of this work is to examine the assumptions behind each index. Most of the assumptions are unexamined in the literature, and yet these assumptions have implications for assessments of reliability that need to be understood, and that result in paradoxes and abnormalities. This chapter d scusses 13 paradoxes and nine abnormalities to illustrate the 24 assumptions. To facilitate understan ing, the analysis focuses on categorical scales with two coders, and further focuses on binary scales where appropriate. The discussion is situated mostly in analysis of communication content. The assumptions and patterns that we will discover will also apply to studies, evaluations, and diagnoses in other disciplines with more coders, raters, diagnosticians, or judges using binary or multicategory scales.