ABSTRACT

Ratings of rich response formats have been a part of the assessment landscape for as long as there have been assessments. Short-answer and multiple-choice question formats largely eliminate extraneous variability in scoring, but in many areas of student and teacher assessment, rater bias, variability, and other factors affect assessment scores. Unusual rating behavior (DeCarlo, 2008;Patz et al., 2002;Wolfe and McVay, 2002), factors in raters’ backgrounds (Winke et al., 2011), the circumstances of rating (Mariano and Junker, 2007), and their effects on procedures for producing and reporting assessment scores (e.g.,Yen et al., 2005) continue to be of central interest. In the case of student work, the relative merits of automated machine-scoring algorithms versus human scoring continue to be of interest (Attali and Burstein, 2005; CTB/McGraw-Hill, 2012;Steedle and Elliot, 2012; J.Wang and Brown, 2012). In the case of teacher assessment, human rating continues to be the only way to assess certain aspects of professional practice (e.g.,Casabianca and Junker, 2013;Casabianca et al., 2013;Farrokhi et al., 2011;Junker et al., 2006).