ABSTRACT

This chapter introduces regression through simple linear regression. Three examples are given that illustrate fitting a line to data graphically using ggplot and finding the slope and intercept of the line of best fit using lm. The equations for the slope and intercept of the line that minimizes the sum of squared errors are found using geometry and calculus.

A crucial part of this chapter is the examination of the residuals. We provide real data sets that illustrate the main ways in which residuals can be problematic: lack of a linear relationship between the response and predictor, heteroscedasticity, non-normality, outliers, and lack of independence. After learning how to check whether the data satisfies the assumptions of the model, readers see how to perform inference. Special attention is given to testing whether the slope is zero. Then, we present confidence intervals for the slope and intercept, as well as confidence and prediction intervals for the response. All of the inference topics are applied to real data sets. We use simulations to investigate what happens when the data comes from a process that does not meet the assumptions for regression.

We conclude the chapter with sections on leave one out cross validation and the bias variance tradeoff, which also serve as an introduction to predictive modeling. Consequences of overfitting are discussed in the context of the penguins data set. A vignette introduces simple logistic regression with an example.