ABSTRACT

In many cases, text is the earliest type of data that the people are exposed to. Increases in computational power, the development of new methods, and the enormous availability of text, mean that there has been a great deal of interest in using text as data. Text modeling is an exciting area of research. But, and this is true more generally, the cleaning and preparation aspect is often at least as difficult as the modeling. Stop words are words such as “the”, “and”, and “a”. For instance, if the text corpus was particularly messy or the existence of particular words was informative. A tuple is an ordered list of elements. In the context of text it is a series of words. Stemming and lemmatizing words is another common approach for reducing the dimensionality of a text dataset. Duplication is a major concern with text datasets because of their size.