Automated Short Answer Scoring : Principles and Prospects

doi:10.4324/9780203122761-9

ABSTRACT

In educational assessment, there is an obvious role for the short answer item as a complement not only of longer essays but also of the traditional selected response item. It is now well established that automated scoring can contribute to the reliability, feasibility, and cost-effectiveness of large-scale essay grading. Can automated scoring do the same for the short answer? In our view, the answer to the question is a qualified “Yes.” The purpose of the present chapter is to introduce the technology, explain some of the uses to which it can be put, and indicate why our generally positive conclusion still has to be qualified. We will refer primarily to Educational Testing Service’s (ETS’s) c-rater engine, because that is what we know best. The main points of this chapter are:

Short answer items are not the same as essays. Typical essay tasks emphasize grammar and mechanics, while typical short answer tasks emphasize content. Also, there is much more variation between short answer rubrics than there is between essay rubrics.

Along with the substantive differences between essays and short answers come corresponding differences in the technological approaches that are needed. One reason for the difference is the simple fact that short answers are short, and therefore usually contain a smaller amount of exploitable information than do longer responses.

Much of the work of an automated scoring engine for essays can be done at the levels of spelling, grammar, and vocabulary, whereas an engine for short answers must address meaning as a primary concern. From the perspective of computational linguistics, an essay scoring engine is primarily, but not exclusively, an application of computational syntax and stylistics, while a short answer scoring engine is primarily an application of computational semantics. The former fields have the more mature technology.

The automated components cannot work alone, but are dependent on prompt-specific analysis and knowledge engineering using human expertise. It is possible to get extremely good results when this preparatory work is done well, but not currently feasible to build a generic engine capable of scoring unseen items for which this work has not been done.

From an assessment perspective it is desirable that items be as rich as possible, with test-takers being given the maximum opportunity to show their knowledge and skills, but from the perspective of automated scoring it is desirable that the open-endedness of the items be restricted. Successful designs will strike a balance between these desiderata. One way to do this is to ensure that both the needs of assessment and the needs of automated scoring are represented in the design process. This suggests that an effective strategy for achieving high quality in automated scoring of short answer items will be to foster long-term collaboration between content specialists and automated scoring experts.