ABSTRACT

The multiple regression model as described in the previous chapter assumes that the response variable, y, is continuous and, given the values of the explanatory variables, has a normal distribution with a mean that is a linear function of the explanatory variables and variance, σ2. But in many studies in medicine, the response variable is binary-for example, improved or not improved, diseased or not diseased, or even dead or alive. Many of the data sets considered in Chapter 4 were of this type. In this chapter, we examine a suitable technique, logistic regression, for exploring the effects of explanatory variables on a binary response variable. (Logistic regression can also be applied to categorical responses with more than two categories; see, for example, Hosmer and Lemeshow 2000.)

In any regression problem, the key quantity is the mean or expected value of the response variable, given the values of the explanatory variables. In linear regression, the expected value of a response variable, y, is modelled as a linear function of the explanatory variables x1, x2, … , xp, that is,

E (y|x1, ... , xp) = β0 + β1x1 ... + βp xp (9.1)

For a dichotomous response variable coded 0 and 1, this expected value is simply the probability, π, that the response variable takes the value one. This could be modelled directly as before, but there are two clear problems:

r The predicted value of the probability, π, must satisfy 0 ≤ π < 1, whereas a linear predictor can yield values from minus infinity to

r The observed values of y conditional on the values of the explanatory variables will not now follow a normal distribution with mean π, but rather a Bernoulli distribution, as we shall explain later.