The Right Stuff: Do You Need to Sanitize Your Corpus When Using Latent Semantic Analysis?

doi:10.4324/9781315782379-157

Chapter

The Right Stuff: Do You Need to Sanitize Your Corpus When Using Latent Semantic Analysis?

ABSTRACT

Student responses to conceptual physics questions were analyzed with latent semantic analysis (LSA), using different text corpora. Expert evaluations of student answers to questions were correlated with LSA metrics of the similarity between student responses and ideal answers. We compared the adequacy of several text corpora in LSA performance evaluation, including the inclusion of written incorrect reasoning and tangentially relevant historical information. The results revealed that there is no benefit in meticulously eliminating the wrong or irrelevant information that normally accompanies a textbook. Results are also reported on the impact of corpus size and the addition of information that is not topic relevant.