ABSTRACT

In Chapter 5, we discussed that a linear regression model can be expressed in terms of probability distribution. In both linear and nonlinear regression problems, the normality (more specifically, conditional normality) assumption is used for the response variable. That is, a linear or nonlinear model can be expressed as:

y ∼ N(f(x, θ), σ2) (8.1) where f(x, θ) represents the mean function. For linear regression models, f(x, θ) = Xβ is a linear function of predictors. This probabilistic assumption allows us to use the least squares method to estimate model coefficients θ. The least squares method is computationally simple and conceptually easy to understand. Consequently, when we have response variables that are known to have distributions other than the normal distribution, we often consider transformations to make the residual distribution approximately normal. When the normality assumption of equation 8.1 holds, the likelihood of observing data yi and xi (i = 1, · · · , n) is the product of the normal density evaluated at each observation:

L(y|β, σ2) = n∏ i=1

1√ 2πσ

which is a function of unknown parameters β and σ2. The maximum likelihood estimator of β and σ2 is the same as the least squares estimator. When the response variable follows a different distribution, the least squares method is no longer appropriate. Consequently, the maximum likelihood estimator is always used for models with a nonnormal response variable. The generalized linear model or GLM is a class of models for a set of specific response variable distributions from the exponential family of distributions. The exponential family includes many familiar distributions, including the normal distribution. The probability density function of the exponential family can be expressed in a general form (equation 8.2).