Web Mining

doi:10.1201/9781003010623-12

ABSTRACT

Information on the Internet is growing exponentially every day. If the problem of political science in the 20th century was the lack of data to test hypotheses, the 21st century presents another challenge: information is abundant and within reach, but one has to know how to collect and analyze it. One of the most used techniques to extract information from websites is the technique known as web scraping. Web scraping is becoming an increasingly popular technique in data analysis because of its versatility in dealing with different websites. Before getting into the practice of web scraping one has to better understand what it is and how the robots.txt file present in most websites works. This file contains the so called robot exclusion standard which is a series of instructions specially directed to programs that seek to index the content of the pages (for example the Google bot that “saves” new pages that are created on the Internet).