ABSTRACT

The linear model is a class of statistical models with the function f(x) to be linear. Among this most frequently used class of models are simple and multiple regression models (for continuous predictors) and analysis of variance models (or ANOVA, for categorical predictors). As a transition, let us first look at a t-test for comparing two populations. In Chapter 4, one example we used is to compare the TP concentrations from the impacted sites and the same from background sites. In that example, sample means from the two populations are calculated and compared. The objective was to examine the effect of human impact on water column TP concentrations in the Everglades. When performing the two-sample t-test, we first take the subset of the data, using only data from 1994 and sites “E4” and “F4” for impacted sites and excluding “E5” and “F5” from the reference sites:

#### R Code ####

subI <- (TP.impacted$SITE=="E4"|TP.impacted$SITE=="F4") &

TP.impacted$Year==1994

subR <- TP.reference$Year==1994 & TP.reference$SITE!="E5" &

TP.reference$SITE!="F5"

x<-log(TP.impacted$RESULT[subI])

y<-log(TP.reference$RESULT[subR])

t.test(x, y, alternative="greater", var.equal=T)

#### R output ####

Two Sample t-test

data: x and y

t = 5.4022, df = 49, p-value = 1.922e-06

alternative hypothesis: true difference in means is not

equal to 0

95 percent confidence interval:

0.52947 1.15672

sample estimates:

mean of x mean of y

2.8909 2.0478

The t-test process includes an estimation of sample means for each population (2.89 and 2.05), and the t statistic (5.4), which is the the ratio of the difference in sample means and the standard error. The standard error of the sample mean difference is based on the pooled standard deviation, which is calculated as the sample standard deviation of the residuals, the differences between observed log TP concentrations and their respective sample means. The same t-test can be conducted using a very different syntax in R, the

formula interface:

#### R code ####

two.sample <- data.frame(TP = c(TP.impacted$RESULT[subI],

TP.reference$RESULT[subR]),

x = c(rep("I", sum(subI)),

rep("R", sum(subR))))

t.test(log(TP) ~ x, data=two.sample, var.equal=T)

The formula y ∼ x is the R convention for specifying a “model.” The tilde (∼) separates the response variable from the predictor variable(s). When used in the two-sample t-test, the function t.test knows that the predictor is used to separate the data into two groups. But we can also interpret the use of a formula to be that the estimated means can be seen as a prediction of some sort. For example, we can predict a future sample from the reference site by using the estimated mean. Therefore, a two-sample t-test problem can be seen as a statistical modeling problem. For each observation, we introduce a predictor x, indicating whether the sample was taken from the reference sites (x = 0) or the impacted sites (x = 1). That is, the data set now has two variables: the response variable y (the log TP concentrations) and the predictor x. (Later we will discuss the advantages of treating a categorical variable with 2 levels as numeric variable taking values 0 and 1.) The t-test problem is equivalent to a modeling problem with the model form to be

y = β0 + β1x+ ǫ

The model coefficient β0 is the estimated log TP mean of the reference sites, that is, y = β0 + β1 × 0 + ǫ. Assuming ǫ ∼ N(0, σ), x = 0 (reference sites) is equivalent to assuming y ∼ N(β0, σ). When x = 1 (impacted sites), the model is equivalent to y ∼ N(β0 + β1, σ). Therefore, β1 is the difference between the impacted site log TP mean and the log TP mean from reference sites. The t-test problem discussed earlier is now the hypothesis testing problem of whether β1 is different from 0. This model can be directly implemented in R using the function lm():

#### R code ####

two.sample <- data.frame(y = c(TP.impacted$RESULT[subI],

TP.reference$RESULT[subR]),

rep(0, sum(subR))))

t2lm <- lm(log(y) ~ x, data=two.sample)

summary(t2lm)

#### R output ####

Call:

lm(formula = log(y) ~ x, data = two.sample)

Residuals:

Min 1Q Median 3Q Max

-0.694 -0.331 -0.102 0.151 1.763

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.048 0.100 20.4 < 2e-16

x 0.843 0.156 5.4 1.9e-06

---

Residual standard error: 0.549 on 49 degrees of freedom

Multiple R-Squared: 0.373, Adjusted R-squared: 0.36

F-statistic: 29.2 on 1 and 49 DF, p-value: 1.92e-06

The R output lists the estimated model coefficients, their corresponding standard error, the t-value for testing whether or not the true value of the coefficient is 0, and the p-value of the test. As expected, the estimated intercept (βˆ0) is the same as the sample mean of log TP for the reference sites,

and the estimated slope (βˆ1) is the difference between the two sample means. The linear model result is exactly the same as the t-test result because the underlying statistical assumptions for the two-sample t-test and for the linear model are exactly the same. The t-test is done specifically to answer the question of whether there is a difference in the sample means, while the linear model is aimed to develop a general predictive model. The two-sample t-test problem is a special case. Representing the two different populations using a predictor variable as we

did here helps to generalize the t-test to a problem of comparing means for more than two populations. When the predictor is a continuous variable, we have the familiar linear regression problem. But all these variations of linear models can be expressed in terms of a linear function of the predictor(s):

y = β0 + β1x1 + · · ·+ βpxp + ǫ or in matrix notation:

y = Xβ + ǫ (5.1)

In R, the model is expressed in a formula:

In a linear model, the model error term ǫ is assumed to be a random variable from a normal distribution with mean 0 and a constant standard deviation, ǫ ∼ N(0, σ). Residuals are assumed to be independent random samples from same distribution. Statistical inference about the fitted linear model is largely based on these assumptions and the estimated model error standard deviation σˆ. In this chapter, analysis of variance (ANOVA) is first introduced as a special

case of a linear model. Although ANOVA was developed for experimental data for causal inference, it is often used in observational studies as an exploratory data analysis tool. The Everglades example is not a data set typically analyzed using ANOVA, it is used to serve as a transition between hypothesis testing and statistical modeling. Model fitting and diagnostics are illustrated using the PCB in the fish example and the Finnish lakes example.