ABSTRACT

Introduction to Data Science: Data Analysis and Prediction Algorithms with R introduces concepts and skills that can help you tackle real-world data analysis challenges. It covers concepts from probability, statistical inference, linear regression, and machine learning. It also helps you develop skills such as R programming, data wrangling, data visualization, predictive algorithm building, file organization with UNIX/Linux shell, version control with Git and GitHub, and reproducible document preparation.

This book is a textbook for a first course in data science. No previous knowledge of R is necessary, although some experience with programming may be helpful. The book is divided into six parts: R, data visualization, statistics with R, data wrangling, machine learning, and productivity tools. Each part has several chapters meant to be presented as one lecture.

The author uses motivating case studies that realistically mimic a data scientist’s experience. He starts by asking specific questions and answers these through data analysis so concepts are learned as a means to answering the questions. Examples of the case studies included are: US murder rates by state, self-reported student heights, trends in world health and economics, the impact of vaccines on infectious disease rates, the financial crisis of 2007-2008, election forecasting, building a baseball team, image processing of hand-written digits, and movie recommendation systems.

The statistical concepts used to answer the case study questions are only briefly introduced, so complementing with a probability and statistics textbook is highly recommended for in-depth understanding of these concepts. If you read and understand the chapters and complete the exercises, you will be prepared to learn the more advanced concepts and skills needed to become an expert.

A complete solutions manual is available to registered instructors who require the text for a course.

chapter 1|10 pages

Getting started with R and RStudio

part I|2 pages

R

chapter 2|32 pages

R basics

chapter 3|8 pages

Programming basics

chapter 4|22 pages

The tidyverse

chapter 5|10 pages

Importing data

part II|2 pages

Data Visualization

chapter 6|4 pages

Introduction to data visualization

chapter 7|18 pages

ggplot2

chapter 8|32 pages

Visualizing data distributions

chapter 9|30 pages

Data visualization in practice

chapter 10|34 pages

Data visualization principles

chapter 11|8 pages

Robust summaries

part III|2 pages

Statistics with R

chapter 12|2 pages

Introduction to statistics with R

chapter 13|24 pages

Probability

chapter 14|20 pages

Random variables

chapter 15|26 pages

Statistical inference

chapter 16|34 pages

Statistical models

chapter 17|14 pages

Regression

chapter 18|38 pages

Linear models

chapter 19|12 pages

Association is not causation

part IV|1 pages

Data Wrangling

chapter 20|2 pages

Introduction to data wrangling

chapter 21|8 pages

Reshaping data

chapter 22|10 pages

Joining tables

chapter 23|8 pages

Web scraping

chapter 24|34 pages

String processing

chapter 25|6 pages

Parsing dates and times

chapter 26|14 pages

Text mining

part V|2 pages

Machine Learning

chapter 27|22 pages

Introduction to machine learning

chapter 28|14 pages

Smoothing

chapter 29|16 pages

Cross validation

chapter 30|6 pages

The caret package

chapter 31|44 pages

Examples of algorithms

chapter 32|8 pages

Machine learning in practice

chapter 33|58 pages

Large datasets

chapter 34|6 pages

Clustering

part VI|2 pages

Productivity Tools

chapter 35|2 pages

Introduction to productivity tools

chapter 36|18 pages

Organizing with Unix

chapter 37|16 pages

Git and GitHub