ABSTRACT

In this chapter, authors will continue to develop data wrangling skills. In particular, they will discuss tidy data, common file formats, and techniques for scraping and cleaning data, especially dates. It turns out that there are substantive reasons to prefer the long (or tall), narrow version of these data. This process will maintain the provenance of your data and allow analyses to be updated with new data without having to start data wrangling from scratch. The long, narrow format for the Gapminder data that the authors have outlined above is called tidy data. Tidy data exists in systematically defined data tables. The wrangling itself is accomplished by using data verbs that take a tidy data table and transform it into another tidy data table in a different form. Conforming to the rules for tidy data simplifies summarizing and analyzing data.