ABSTRACT

Statistics and data science begin with data collection. A primary focus of study design is how to best collect data that provides evidence to support answering questions. In some contexts data collection can be designed so that an investigator can control how people are assigned to a group, and in other contexts it is neither feasible nor ethical to assign people to groups that are to be compared. In the latter case the only choice is to use data where people ended up in groups, but we do not know the probability of belonging to a group. This chapter contains a very brief overview of using R to read, manipulate, summarize, visualize, and generate data. When data is collected as part of a study, it will almost always be stored in computer files. Raw data from a study is usually not in a format that is ready to be analysed, so it must be manipulated or wrangled before analysis.