ABSTRACT

Latent semantic analysis (LSA) is a “bag of words” technique. A training corpus is represented as a word by document matrix. Weights are entered into the cells of the matrix based on the number of times a word appears in each document, but no attempt is made to capture the order in which the words appeared within the document. Furthermore, when constructing the meaning representation of a new document, the vectors representing each unique word in the document are added, so again no attempt is made to capture word order. This insensitivity to word order has been raised as an important limitation of LSA.