ABSTRACT

Today much social interaction and interpersonal communication takes place on the internet. This produces large amounts of both textual data and the behavioral marks people leave when surfing the web. Social media has gained considerably in importance in recent years and represents a rich and indispensable data source for social research. Moreover, much data is available only on the internet. This raises the questions of how to identify appropriate methods to collect such data and how to process it to gain the desired insights. In answering these questions, the chapter focuses on the digital collection of textual data. Web scraping, also known as web harvesting, describes the extraction of information from web pages. Normally this information is the text published thereon. Although such an extraction may be done manually by copying and pasting, the usual rule is, increasingly, the automatic extraction of information by parsing techniques and software bots (search engines). For this chapter, current techniques are drafted before the setting up of a web page that explains how to extract the information published on it. A fictitious example page is built to cover a use case of special social science interest: the analysis of textual data published on blogs on scientific topics. The website of the European Commission, for instance, practices such a blog on artificial intelligence and acts as a reference in the present case.