ABSTRACT

There are two ways to represent text in a computer for the purposes of performing analysis. The documents are reduced to a “bag-of-words” representation and the ordering is retained, and sentences are processed to determine their grammatical structure. Any text corpus requires preprocessing to clean out errors and unwanted characters and words. The specifics of the preprocessing is very corpus dependent and is dependent on the desired inference. A vector space representation of a corpus of documents calculates a vector of numbers for each document. In the standard bag-of-words model, the vector is a collection of weights, one per word in the lexicon, with the weights being some representation of the “importance” of the word in the document, or the “information” of the word relative to the corpus. Latent semantic indexing is a way of reducing the dimensional of the term document matrix by embedding it in a lower dimensional space.