ABSTRACT

Ideally, every epidemiological study would be designed and subsequently analyzed with attention given to a small set of risk factors, and a further set of possible confounding or interacting variables, with the roles of each of these variables identified a priori. In this case, it would make sense to build logistic regression models from the “bottom up,” beginning with the few exposures of interest, and then examining the issues of confounding and interaction associated with extraneous variables, much as we have done in the two examples considered at length in Chapters 12 to 14. Unfortunately, in most cases selection and elucidation of the exact nature of suspected exposures is difficult in the design phase of a study; thus, many possible candidate exposure variables (e.g., possible proxies for some underlying proposed risk factor such as social support, or stress) are measured on sampled individuals. In these studies, we face the construction of a regression model with only limited prior knowledge to guide us. In this chapter, we consider three statistical issues that direct a sensible approach to regression model building: (1) choosing the scale of a selected explanatory variable, (2) general model building strategies, and (3) methods to assess whether a “final” model fits the sample data adequately. All of these ideas arise from the same questions about traditional regression analyses of continuous outcome variables, though details differ because of a binary outcome, D, here. We only touch on these topics, which are discussed further in monographs on logistic regression analysis, particularly Hosmer and Lemeshow (2000).