ABSTRACT

One of the most common data wrangling challenges involves extracting numeric data contained in character strings and converting them into the numeric representations required to make plots, compute summaries, or fit models in R. Many of the string processing challenges a data scientist faces are unique and often unexpected. However, they don’t follow a unifying convention, which makes them a bit hard to memorize and use. Most of the examples will come from the second case study which deals with self-reported heights by students and most of the chapter is dedicated to learning regular expressions (regex), and functions in the stringr package. In general, string processing tasks can be divided into detecting, locating, extracting, or replacing patterns in strings. A regular expression is a way to describe specific patterns of characters of text. Character classes are used to define a series of characters that can be matched. Groups are a powerful aspect of regex that permits the extraction of values.