ABSTRACT

This chapter goes through a variety of approaches for gathering data. It begins with the use of Application Programming Interfaces (APIs) and semi-structured data, such as JSON and XML. The chapter focuses on web scraping, which one may want to use when there are data available on a website. It then considers gathering data from PDFs. This enables the construction of interesting datasets, especially those contained in government reports and old books. The advantage of using an API is that the data provider usually specifies the data that they are willing to provide, and the terms under which they will provide it. Web scraping is possible by taking advantage of the underlying structure of a webpage. PDF files were developed in the 1990s by the technology company Adobe. It is often possible to copy and paste the data from the PDF. Optical Character Recognition has been used to parse images of characters since the 1950s, initially using manual approaches.