ABSTRACT

Reliability and measurement error are important classical psychometric aspects of educational and psychological tests that require little introduction. In this chapter, I describe the concepts of test reliability and errors of measurement in the context of multistage testing and item response theory (IRT). I stress that the classical notion of test reliability applies to the sum of a number of variables, for example, the sum of the correctly answered items in a linear test (Guttman 1945). In a multistage test, however, test takers are administered test forms with different items of different difficulty, and test takers’ responses are used to determine which level of difficulty they should receive. Therefore determining the reliability of a multistage test is more complicated than for a linear test. It is nonetheless less involved than for an item-level adaptive test because the number of different test forms in a multistage test is generally limited. For example, if a three-stage test has three difficulty levels for stages 2 and 3, then the total number of test forms is nine. If each stage consists of ten items and no overlap is allowed, then the total number of items needed is ninety. In contrast, an item-by-item adaptive test of thirty items built from an item bank of ninety items has 90 × 89 × · · · × 61 = 6.73133× 1023 possible test forms. The classical notion of reliability as an indicator of measurement precision of a fixed test to be administered to all test takers seems, therefore, difficult to retain for the cases of multistage and adaptive testing because individuals are no longer administered the same set of items. Nevertheless, test reliability can still be useful for these cases, and I will demonstrate how to estimate appropriate reliability measures by making use of IRT methodology.