ABSTRACT

In this chapter we discuss best practices for quality control of automated scoring (AS) systems in large-scale assessment contexts, oriented around four key guiding principles for oversight of an AS modeling pipeline: (1) that it produces valid scores based soundly in the target construct, (2) that defensible procedures and standards are employed in its execution, (3) that it produces accurate scores that reliably measure mastery or ability, and (4) that the processes and methods used to produce those scores are interpretable by key stakeholders. We highlight the interdependencies between the various technical systems used in computing AS scores and the human/organizational systems that provide supporting context in the overall ecology of the AS modeling pipeline while illustrating the positive and negative impacts these systems can have on each other. We review specific quality-control practices and statistical measures that can provide greater insight into the efficacy of the AS pipeline and instill confidence in the quality of scores produced by both human and machine scores. We discuss key aspects of feature engineering, calibration pool scoring, model evaluation analyses, and documentation of design decisions.