ABSTRACT

The book equips students with the end-to-end skills needed to do data science. That means gathering, cleaning, preparing, and sharing data, then using statistical models to analyse data, writing about the results of those models, drawing conclusions from them, and finally, using the cloud to put a model into production, all done in a reproducible way.

At the moment, there are a lot of books that teach data science, but most of them assume that you already have the data. This book fills that gap by detailing how to go about gathering datasets, cleaning and preparing them, before analysing them. There are also a lot of books that teach statistical modelling, but few of them teach how to communicate the results of the models and how they help us learn about the world. Very few data science textbooks cover ethics, and most of those that do, have a token ethics chapter. Finally, reproducibility is not often emphasised in data science books. This book is based around a straight-forward workflow conducted in an ethical and reproducible way: gather data, prepare data, analyse data, and communicate those findings. This book will achieve the goals by working through extensive case studies in terms of gathering and preparing data, and integrating ethics throughout. It is specifically designed around teaching how to write about the data and models, so aspects such as writing are explicitly covered. And finally, the use of GitHub and the open-source statistical language R are built in throughout the book.

Key Features:

  • Extensive code examples.
  • Ethics integrated throughout.
  • Reproducibility integrated throughout.
  • Focus on data gathering, messy data, and cleaning data.
  • Extensive formative assessment throughout.

part I|82 pages

Foundations

chapter 2Chapter 1|14 pages

Telling stories with data

chapter Chapter 2|32 pages

Drinking from a fire hose

chapter Chapter 3|34 pages

Reproducible workflows

part II|74 pages

Communication

chapter 84Chapter 4|24 pages

Writing research

chapter Chapter 5|48 pages

Static communication

part III|112 pages

Acquisition

chapter 158Chapter 6|36 pages

Farm data

chapter Chapter 7|42 pages

Gather data

chapter Chapter 8|32 pages

Hunt data

part IV|76 pages

Preparation

chapter 270Chapter 9|52 pages

Clean and prepare

chapter Chapter 10|22 pages

Store and share

part V|112 pages

Modeling

chapter 346Chapter 11|38 pages

Exploratory data analysis

chapter Chapter 12|32 pages

Linear models

chapter Chapter 13|40 pages

Generalized linear models

part VI|94 pages

Applications

chapter 458Chapter 14|44 pages

Causality from observational data

chapter Chapter 15|22 pages

Multilevel regression with post-stratification

chapter Chapter 16|20 pages

Text as data

chapter Chapter 17|6 pages

Concluding remarks