Linear Regression Models | 9 | A Whistle-Stop Tour of Statistics

ABSTRACT

Regression: A frequently applied statistical technique that serves as a basis for studying and characterizing a system of interest, by formulating a reasonable mathematical model of the relationship between a response variable

y and a

set of p explanatory variables x1, x2, … xp. The choice of an explicit form of the model may be based on previous knowledge of a system or on considerations such as ‘smoothness’ and continuity of y as a function of the explanatory variables (sometimes called the independent variables, although they are rarely independent; explanatory variables is the preferred term). Simple linear regression: A linear regression model with a single explanatory variable. The data consist of n pairs of values (y1, x1), (y2, x2), … (yn, xn). The model for the observed values of the response variable is

y x i ni i i= + + =β β ε0 1 1, … where β0 and β1 are, respectively, the intercept and slope parameters of the model and the εi are error terms assumed to have a N(0,

σ2) distribution. The

parameters β0 and β1 are estimated from the sample observations by least

squares, i.e., the minimization of S i i

S y xi i

i= − −

∑( ) 1

∂ = − − −

∂ ∂ = − −

y x

S y

β β β

β β

( )

( 0 1− β x xi i)

Setting ∂ ∂ =

∂ ∂ =

S S β β0 1

0 0, leads to the following estimators of the two model

parameters:

ˆ ˆ , ˆ ( )( )

( )

= − =

− −

−

∑ y x

y y x x

x x

The variance σ2 is estimated by s y y

2 =

−

∑( ) . The estimated

variance of the estimated slope parameter is Var(ˆ )

( )

−

x xi i

n The

estimated variance of a predicted value ypred at a given value of x, say, x0, is

Var pred( ) ( )

( )

y s n

x x

x xi i

n = + +

−

1 1

Multiple linear regression: A generalization to more than a single explanatory variable of the simple linear regression model. The multiple linear regression model is given by

y x xi i p ip i= + + + +β β β ε0 1 1

i x i i ip1 2, represent this individual’s values on p explanatory variables, with i = 1, 2, … n. As usual, n represents the sample size. The residual or error terms εi, i = 1, … n are assumed to be independent random variables having a normal distribution with mean zero and constant variance σ2. So the response variable y also has a normal distribution with expected value E y x x x x xp p p| , , ,1 2 0 1 1 ( ) = + + +β β β and variance σ2 .