ABSTRACT

Educational tests yield scores, and a score needs to have a meaning. If a student takes a test and receives a result of “ten,” then a natural question arises: what does the “ten” represent? There are two classic answers to this question: norm referencing and criterion referencing. The earliest work on normative testing dates from the late nineteenth century, when Galton

built his Anthropometric Laboratory and addressed the question of why people need to be measured. For example, he wrote about human eyesight: “Measurement would give an indication of the eyesight becoming less good, long before the child would find it out for himself, or before its impairment could attract the observation of others” (Galton 1890: 2). Such a stance is normative because it views eyesight in relative terms: “less good.” To return to the opening question, above: the norm referenced meaning of a “ten” on a language test would be the rank of the test taker in a group of peers-the norm group: how many scored above ten? How many scored below? And more crucially: what is the normative score value for some particular decision, such as admission to or exit from a particular language program? In a classic paper, Glaser (1963) coined the notion of criterion referencing, which nowadays

is typically defined in contrast to norm referenced testing. A problem arises when a test is given after an instructional sequence, such as a module or unit in an academic semester-long class. In such a case, the score range may be very narrow. That would be exactly what a teacher wants in an achievement test: very little score spread, because it shows that the teaching was successful. The teacher’s chief concern is about the skills and abilities that the students can demonstrate: the criteria against which they will be measured. If the entire group of students can display mastery of the entire suite of criteria for a particular teaching unit or module, then the variation of scores could be even be zero-there may have been ten separate skills in the teaching module, and all of the students performed all ten skills flawlessly, and so all of them scored “ten.” In contrast to norm referencing, there would be little reason to rank and to compare the students: if they all achieve, they all get full value. This distinction led to a body of criterion referenced statistics. If a score of “ten” on a lan-

guage test is deemed to be a passing mark on a criterion referenced test, the presumption is that a “ten” indicates mastery of a sufficient number of skills in a very well-defined domain of language ability. Unlike normative item statistics (J.D. Brown, this volume), criterion referenced statistics look at the dependability of judging mastery and view that dependability in much the same

Davidson, 2007: Chapters A4 and C4). Modern test score reports obtain their richness, in part, from their bank of detailed test specs, and if the test still serves a normative function (and many do), then it is no longer accurate to say that detailed test specs are the sole province of criterion referencing. They have become part of all good testing practice. Detailed test specs can come in many formats and design. All formats share two common

elements: sample test tasks and “guiding language” about how to produce such samples (Fulcher and Davidson, 2007: Chapter A4). The following is an example of a detailed test spec.