ABSTRACT

While the scalability and efficiency promised by automated scoring have led to its widespread application in language assessment, our field still lacks an established approach to measuring how fit for purpose these systems are; in other words, how valid are they? More daunting still for the educational practitioner is the complexity and unexplainable nature of elements of “black box” scoring systems, a situation exacerbated by the reluctance of some EdTech vendors to share details of proprietary technology. In this chapter, we focus on the British Council’s partnership with a technology vendor to leverage the Model Card approach proposed by Mitchell et al. (2019) in the validation of a machine-scored placement test. The process entailed integrating the general principles of AI model development identified by Mitchell and colleagues to our existing validation process. In addition to classical analysis to estimate accuracy of the machine scoring model when compared to human ratings, additional Many-Facet Rash Measurement analyses were used to explore the data for bias. Critical learning from the project included the importance of both good communications between the EdTech and language testing teams and well-structured and organised data.