ABSTRACT

The task of normalizing or preprocessing the text to eliminate irrelevant differences is critically important and unique to text data analysis, so it is discussed briefly in the following paragraphs and illustrated in the example presented in Sec. A key challenge of working with text data is that it is, by nature, unstructured, coming in a wide and growing variety of formats, lengths, languages, and styles. This chapter provides an introduction to some of the simplest of specialized text analysis methods and their implementation in R. It presents the other class of text analysis tools consists of specialized packages like tm or quanteda, both of which will be used in examples. The chapter suggests that, for analysis, text data must ultimately be converted into numbers that are amenable to mathematical characterizations. An increasingly popular way to represent frequent terms in text data is with wordclouds, which display words in different sizes, depending on how frequently they occur in a document.