ABSTRACT

When two runners race against one another, no fancy measurement schemes are required to determine who wins; just a careful look at the finish line. Similarly, if everyone takes the same test, it is not surprising that whoever gets the most items correct is considered to have shown the greatest mastery of the material. But suppose the two runners ran on different days, or on different courses, or both. Then we would need to somehow compare them at-a-distance. An accurate timing mechanism suffices if they have run on different days, but what do we do if the courses were different? There are two approaches possible. The more common one is to establish standards for all race courses: a standard length, a standard flatness, standardized conditions for wind, temperature, and so on. Thus, we would be in a better state to say that, as nearly as possible, both runners competed under identical circumstances. A second approach, if such control was not possible, would be to statistically correct for the differences. Sometimes this statistical correction is done formally, as when comparing performances in metric distances to those in English measures (“A 10-second time in 100 meters is equivalent to a 9.1 for 100 yards”); sometimes it is done on the basis of expert judgment (“Joe Louis would have beaten Muhammad Ali, if they were both at their peaks”). Sometimes this is stretched further than some might consider prudent (“Mike Powell’s 29-foot broad jump was a greater athletic feat than Mark McGwire’s 70 home runs.”) The term Standardized Test explicitly tells what adjustment strategy is used for most large scale paper-and-pencil tests. Although examinees may take different forms of the test, containing different items, and they may take them at different times and in different parts of the country, strenuous attempts are made to make both the tests and the testing situations identical in all aspects that might have an impact on test performance. Each form of the test is a sample of items from a specified item pool, and any differences in the overall difficulty of the total test is corrected for statistically (see chapter 6—Scaling and Equating—for more on this). This is roughly analogous to the corrections that are made when comparing times in the Boston Marathon to those in the New York Marathon; though they are exactly the same length, the latter tends to be a minute or two faster. Traditional equating methods (see Chapter 6) deal very well with such variants.