ABSTRACT

Automated scoring refers to the use of statistical and computational linguistic methods to assign scores or labels to unconstrained open-ended test items. Most current engines use feature-based approaches whereby experts create features and use one or more statistical models to predict scores using the features. Deep learning engines learn features alongside the predictive model using very large, multilayered neural networks, often with millions of parameters. The success of deep learning engines rests on their ability to consider word use in context and on their reliance on models trained on large corpora and fine-tuned to a scoring task. When properly trained, deep learning engines demonstrate accuracy improvements over feature-based models. Regardless of approach, core methods remain the same, including the flow of responses through preprocessing, feature extraction, and score prediction phases. Both types of engines require appropriately sampled data, high-quality hand-scoring, and held-out validation samples in order to build models. Performance aside, deep learning engines have four psychometric challenges. First, explaining how an engine arrives at a score is yet to be understood. Second, the impact of using pre-built models on engine score quality has not been examined in depth. Third, calibration methods tend to be empirically driven – as opposed to theoretically driven – and require extensive compute power and time. Finally, it is yet unclear how robust models will be during live scoring because models are so new. These challenges offer fertile areas for research and require study by psychometricians, data scientists, computer scientists, and computational linguists.