ABSTRACT

Text is data too. It may take the form of books archived by Project Gutenberg, web pages downloaded from the Internet, writings scraped from social media, articles in Word. docx format, information loaded into comma-separated values (CSV) files, and from many other sources. In the first "Quick Start" exercise, the "quanteda" package is used to create keyword in context (kwic) indexing of the Declaration of Independence and the U.S. Constitution. A second "Quick Start" exercise uses the "quanteda", "tidytext", and "ggplot2" packaged to create word frequency tables and histograms showing the comparison of word usage by various U.S. presidents in their inaugural addresses. After this, web scraping using the "htm2txt" and "rvest" packages is explained. Social media scraping is then illustrated, focusing on the Trump Administration and its use of the term "fake news" with regard to the New York Times. Parts of this chapter explain core text analysis concepts such as placing text data in the format needed by various text analysis packages; tokenization of text into words, sentences, or other units; character encoding; and cleaning of text prior to text analysis. Specific forms of text analysis set forth in worked examples include multigroup word frequency comparisons, word clouds and comparison clouds, and word maps and word correlations. Examples of both sentiment analysis and topic modeling are detailed, among other topics.