ABSTRACT

Moderation to improve marker consistency (inter-scorer reliability) is widely practised in education when two or more markers appraise complex student responses to assessment tasks. Early studies into the judgements made by different markers, when acting largely autonomously, showed that the different sets of scores were typically poorly correlated or characterised by substantially different means and variances (Starch and Elliott 1912; Hartog and Rhodes 1935). Subsequent research has produced essentially similar results, although a number of techniques are available that do lead to improved consistency, among them following a common set of guidelines closely. Moderation is intended to ensure that the mark a particular student is awarded is independent of which marker does the marking. The English verb ‘to moderate’ dates from about 1400, and originally meant to regulate or abate excessiveness, to smooth out extremes. Simply averaging the scores from two or three markers literally does that, but without delving below the marks to find the reasons for the differences. Other approaches take a different tack. Linn (1993) reviewed a number of moderation models, one of which involved different assessors

e adler

reaching consensus on how marks should be awarded. Consensus moderation (Linn’s term was ‘social moderation’) provides the starting point for the analysis in this article.