ABSTRACT

There are a lot of possible stories in the airline on-time data such as we could look at how many flights were cancelled by date, airline, airport, or some combination of those. Whatever ways a user might want to group the data, dplyr’s group_by() function will make this type of analysis simple and elegant. This chapter deals with calculating statistics by group with dplyr’s group_by() and summarize() functions, using a lookup table, understanding missing values, and graphing counts in a data frame. Counting cancelled flights by date or by date and airport requires two steps: group data by those categories, and then doing the analysis; dplyr makes this easy. One of the nice things about summarize is that the user can create more than one column at a time. And, the data in the first summary column is immediately available for additional calculations.