ABSTRACT

A major application of natural language processing (NLP) in education is automated or artificial intelligence (AI) scoring of responses to test questions, which entails the use of algorithms, including those from NLP, to score written or other constructed responses to test questions. It has a long history and is currently used by many test companies for both low- and high-stakes testing applications. The Standards for Educational and Psychological Testing emphasizes the importance of ensuring that test scores are fair, which requires that test scores support their intended inferences and uses for all test-takers. Often in applications of educational measurement, fairness is operationalized as the absence of measurement bias. Measurement bias occurs when test-takers of equal ability or proficiency from different identifiable groups, such as race, ethnic, or native language groups, have different score distributions on an item or test. In this chapter, we review the criteria for fairness and connect the definitions used in the AI and educational measurement literatures. We consider methods for testing the fairness of automated scores under the assumption that human ratings are free from bias. We demonstrate that a commonly used test for fairness in automated scores (standardized difference between the group means of human rating and automated score) can fail to detect some forms of bias or identify bias when it does not exist under other definitions. We develop new tests for fairness that align to specific definitions of bias and account for the errors in the human ratings, and we propose possible remedies for unfair scores. We apply our methods to scoring student response to three open-ended reading comprehension problems.