A team of fourth-grade teachers are preparing lessons for the upcoming week and are using an assessment of grade-level mathematics readiness to select appropriate supplemental mathematics materials for groups of students. The teachers have in front of them item-level responses for each student, which they need to aggregate in order to form a total test score from which they can draw inferences. This process of summing up items answered correctly to form a total test score (number or per cent correct) is at the very foundation of classical test theory. Within this framework, the total test score is referred to as an observed score. This shows the level of performance that we see, but not the performance that the test taker might be capable of ultimately. Classical test theory helps test developers and test users, like the fourth-grade team of teachers described here, to understand the discrepancy between observed score and the measure of a person’s true capability, referred to here as a true score. This theory also provides ways of evaluating assessments so that test developers and users alike can rely on these tools to inform practice.

Classical test theory (CTT), also referred to as classical true score theory, like other test theories is a ‘symbolic representation of factors influencing observed test scores’ (Allen and Yen 1979, 56). This simple model is governed by a set of assumptions and their resulting conclusions which describe how errors of measurement affect observed scores on measurement instruments. Classical test theory relies mainly on the assumption that the observed score (X) is a function of the sum of random error (E) and true score (T) (ibid., 57; Crocker and Algina 1986, 107; Hambleton and Jones 1993, 40).

Test development and evaluation is often based on the standard procedures of CTT (Allen and Yen 1979, 56). Psychometricians and test developers commonly utilise CTT to compute the reliability of test scores, evaluate the validity of test scores, perform item analysis, estimate variance components to evaluate sources of error, and equate test scores for various purposes. These purposes are discussed in greater detail in the following sections. This entry begins with a brief history of classical test theory, followed by an explanation of the classical true score model. Reliability and the concept of item analysis are then examined, followed by a brief acknowledgement of the limitations of CTT and a short overview of its counterpart, item response theory.