Measurement reliability and agreement | 21

ABSTRACT

This chapter describes statistical tests which provide an index of how reliable a particular measure is or how much agreement exists between two or more judges. For instance, we may wish to find out to what extent two or more judges categorise or rate subjects in the same way, or to what extent the answers to questions devised to measure the same quality are consistent. The type of test to use depends on whether the data are categorical or not. The most widely recommended index of agreement between two or more judges is Cohen’s (1960) kappa coefficient for categorical data. The most common measure of the reliability of non-categorical data for three or more judges is Ebel’s (1951) intraclass correlation while the most frequently used index of the internal reliability for a set of questions is Cronbach’s (1951) alpha.