ABSTRACT

Human raters are used to assess speaking and writing performance in many contexts. Understanding rater behavior to achieve accurate measurement is important in language testing and SLA research. Tools to reduce the differences that raters inevitably introduce include: well-designed rating schemes; appropriate rater selection, training, and standardization; ongoing monitoring and feedback; routine double marking; and statistical moderation accommodating patterns of leniency/severity and checking intra-rater consistency. This chapter explores issues in rater behavior and argues that SLA and language testing researchers must attend to the training and monitoring of raters because, otherwise, unreliable assessment practices may invalidate test results and research findings.