ABSTRACT

This chapter focuses on numerical data, but there is a whole field of research that focuses on textual data. Fields such as natural language processing and computational linguistics work directly with text documents to extract meaning algorithmically. The chapter explains how text can be ingested, how corpora can be created, and how regular expressions can be used to automate searches that would otherwise be excruciatingly labor-intensive. Text mining is often performed not just on one text document, but on a collection of many text documents, called a corpus. Important technique in text mining involves the calculation of a term frequency-inverse document frequency, or document term matrix. Twitter keeps track of which hash tags or phrases are popular in real-time—these are known as trending topics. Trending topics are available in many major cities and might be used to study how certain populations respond to news or world events.