ABSTRACT

The scoring of essays by multiple raters is an obvious area of application for hierarchical models. We can include effects for the writers, the raters, and the characteristics of the essay that are rated. Bayesian methodology is especially useful because it allows one to include previous knowledge about me parameters. As explained in Johnson and Albert (1999), the situation that arises when multiple raters grade an essay is like the person who has more than one watch-if the watches don’t show the same time, that person can’t be sure what time it is, Similarly, the essay raters may not agree on the quality of the essay; each rater may have a different opinion of the quality and relative importance of certain characteristics of any given essay-Some raters are more stringent than others, whereas others may have less well-defined standards. In order to determine the overall quality of the essay, one may want to pool the ratings in some way, Bayesian methods make this process easy. In our analysis of a dataset that includes multiple ratings of essays by multiple raters, we examine the differences between the raters and the categories in which the ratings are assessed. In the end, we are most interested in the differences in me precision of the raters (as measured by their variances) and the relationships between the ratings.