ABSTRACT

N ew technologies are producing data at an incredible rate, sometimes faster thanit can be stored and analyzed. By some accounts, Google plans to have scanned every book in existence by the end of this decade. Sequencing of the approximately three billion base pairs of the human genome took 10 years when first published in 2000, but can now be done in less than a week. Social scientists currently have at their disposal a steady stream of data from online social networks. Analyses of data of these magnitudes, most of which is stored as text, requires computational tools. In this chapter, we will discuss the fundamental algorithmic techniques for developing these tools. We will look at how text is represented in a computer, how to read text from both files and the web, and develop algorithms to process and analyze textual data.