Fairness in Test Scoring

doi:10.4324/9781315774527-5

ABSTRACT

This chapter addresses example of the test of writing proficiency described above sets the stage for the specific topic of fairness. It organizes the description of relevant methods according to the three types of item scores introduced in the example of the test of writing proficiency described in the beginning of the chapter: automated scores of MC items; human-rater scores of CR items; and automated scores of CR items. Many DIF methodologies are based on first stratifying examinees according to an observed measure of proficiency, and then directly evaluating between-subgroup differences in item performance within each stratum of proficiency. Human-rater scores are based on judgment guided by a series of established scoring decision rules. When the scoring rules are relatively complex, as would be the case in scoring an essay or a complex performance task, the human judgment underlying the scoring process may lead to inconsistency in rater scores.