Managing Dirty Data | 7 | Data Science for Water Utilities

ABSTRACT

Data is almost always produced to manage operations and rarely collected to analyse it in the future. The available data is thus never in an ideal format and needs to be converted to be suitable for analysis. We need, at minimum, a clear data structure, the correct variable types, and readable variables names. Preparing data for analysis, sometimes called data munging or wrangling, is an essential part of the data science workflow. This chapter introduces some techniques to clean data with R and the Tidyverse to create reproducible code. This chapter introduces a case study about customer perceptions about tap water. The learning objectives for this session are:

Use the dplyr package to transform data.

Apply the principles of tidy data.

Develop a script to automate data cleaning.