ABSTRACT

Epidemiological studies for assessing risk factors often use logistic regression, log-linear models, or other generalized linear models. They involve many decisions, including the choice and coding of risk factors and control variables. It is common practice to select independent variables using a series of significance tests and to choose the way variables are coded subjectively. The overall properties of such a procedure are not well understood, and conditioning on a single model ignores model uncertainty, leading to underestimation of uncertainty about quantities of interest (QUOis). We describe a Bayesian modeling strategy that formalizes the model selection process and propagates model uncertainty through to inference about QUOis. Each possible combination of modeling decisions defines a different model, and the models are compared using Bayes factors. Inference about a QUOI is based on an average of its posterior distributions under the individual models, weighted by their posterior model probabilities; the models included in the average are selected by the Occam's Window algorithm. In an initial exploratory phase, the ACE (Alternating Conditional Expectations) algorithm is used to suggest ways to code the variables, but the final coding decisions are based on Bayes factors. The methods can be implemented using GLIB, an S function available free of charge from StatLib. For the special case of logistic regression, the additional S functions ACE.LOGIT and BIC.LOGIT are

available. We apply our strategy to an epidemiological study of fat and alcohol consumption as risk factors for breast cancer. In our previous published analysis, the regression model chosen included not only fat and alcohol consumption but also an interaction term between these two variables. Here, however, the Bayes factors favor a simpler and more interpretable model that includes transformed variables but no interaction term.