Creating the data | 5 | Corpus collection | Matteo Di Cristofaro

ABSTRACT

This chapter focuses on what should be considered when dealing with language data from the web while providing different options on how to achieve it through computer techniques. As such, it details the theoretical and technical aspects of data collection: starting with general introductions to practices and techniques, their underlying implications are described, including potential legal issues related to data scraping, before presenting a number of tools for data collection. These are divided into two major categories, with general purpose scrapers denoting those that are usually applied to web pages that are not part of social media platforms, and platform specific scrapers specifically built to collect data from one (or more) social media platform(s). Each tool is described and exemplified through practical commands and scripts. Readers are then introduced to how the contents of the collected data can be processed and subsequently used to build an annotated corpus.