ABSTRACT

Anyone who has worked with data knows that sometimes, data is not just messy; it is in a format that is downright analysis-hostile. This chapter looks at election results in a less-than-ideal format and turn them into ‘tidy’ format for easier analysis. It deals with converting a PDF to Excel, reshaping data into analysis-friendly tidy format, finding “top 2” results in a group, and adding rankings from low to high or high to low. There is an R package on CRAN, pdftables, that can extract tables from PDFs. However, the user has to sign up for an API key at pdftables.com, and after converting 50 PDF pages for free, he/she needs to pay for the service. There are several different packages that make it easy to reshape a data frame from “wide” – with important information embedded in column names – to “long”. The gathered (“melted”) data frame is easier to work with in order to get what we want.