ABSTRACT

Data wrangling often runs throughout a data-intensive research project. A component of engaging in data-intensive research is that oftentimes no one person holds all of the knowledge necessary for conducting an analysis. Bringing together, or joining, data from different sources is a critical component of most data-intensive research projects. At a general level, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data. Effective and efficient data wrangling is often programmed or scripted, which means that the steps involved in importing, cleaning, and merging data are written out as machine-readable commands that can be revisited, repurposed, and debugged over time. Merging datasets requires common identifiers, often referred to as key variables, across the multiple datasets that one is trying to merge. The growing use of predictive modeling and use of early warning systems regularly require researchers to merge datasets from multiple systems.