ABSTRACT

Data Science: A First Introduction focuses on using the R programming language in Jupyter notebooks to perform data manipulation and cleaning, create effective visualizations, and extract insights from data using classification, regression, clustering, and inference.

The text emphasizes workflows that are clear, reproducible, and shareable, and includes coverage of the basics of version control. All source code is available online, demonstrating the use of good reproducible project workflows.

Based on educational research and active learning principles, the book uses a modern approach to R and includes accompanying autograded Jupyter worksheets for interactive, self-directed learning. The book will leave readers well-prepared for data science projects.

The book is designed for learners from all disciplines with minimal prior knowledge of mathematics and programming. The authors have honed the material through years of experience teaching thousands of undergraduates in the University of British Columbia’s DSCI100: Introduction to Data Science course.

chapter Chapter 1|26 pages

R and the Tidyverse

chapter Chapter 2|40 pages

Reading in data locally and from the web

chapter Chapter 3|54 pages

Cleaning and wrangling data

chapter Chapter 4|50 pages

Effective data visualization

chapter Chapter 5|34 pages

Classification I: training & predicting

chapter Chapter 6|38 pages

Classification II: evaluation & tuning

chapter Chapter 7|24 pages

Regression I: K-nearest neighbors

chapter Chapter 8|22 pages

Regression II: linear regression

chapter Chapter 9|26 pages

Clustering

chapter Chapter 10|32 pages

Statistical inference

chapter Chapter 11|16 pages

Combining code and text with Jupyter

chapter Chapter 12|40 pages

Collaboration with version control

chapter Chapter 13|8 pages

Setting up your computer