ABSTRACT

Patterned missing covariate data is a challenging issue in environmental epidemiology. For example, particulate matter measures of air pollution are often collected only every third day or every sixth day, while morbidity and mortality outcomes are collected daily. In this setting, many desirablemodels cannot be directly fit.We investigate such a setting in so-called “distributed lag” models when the lagged predictor is collected on a cruder time scale than the response. In multi-site studies with complete predictor data at some sites, multilevel models can be used to inform imputation for the sites with missing data. We focus on the implementation of such multilevel models, in terms of both model

development and computational implementation of the sampler. Specifically, we parallelize single chain runs of sampler. This is of note, since theMarkovian structure of Markov chain Monte Carlo (MCMC) samplers typically makes effective parallelization of single chains difficult. However, the conditional independence relationships of our developed model allow us to exploit parallel computing to run the chain. As a first attempt at using parallel MCMC for Bayesian imputation on such data, this chapter largely represents a proof of principle, though we demonstrate some promising potential for the methodology. Specifically, the methodology results in proportional decreases in run-time over the nonparallelized version near one over the number of available nodes. In addition, we describe a novel software implementation of parallelization that is

uniquely suited to disk-based shared memory systems. We use a “blackboard” parallel computing scheme where shared network storage is a used as a blackboard to tally currently completed and queued tasks. This strategy allows for easy addition and subtraction of compute nodes and control of load balancing. Moreover, it builds in automatic checkpointing. Our investigation is motivated by multi-site time series studies of the short-term effects

of air pollution on disease or death rates. A commonmeasure of air pollution used for such studies is the amount in micrograms per cubic meter of particulate matter of a specified maximumaerodynamic diameter.We focus onPM2.5 (see Samet et al., 2000). Unfortunately, the definitive source of particulate matter data in the United States, the Environmental Protection Agency’s air pollution network of monitoring stations, collects data only a few times per week at some locations. One of the most frequent observed data patterns for

2.5 are collected daily. In this setting, directly fitting a model that includes several lags of PM2.5 simultaneously

is not possible. Such models are useful, for example, to investigate a cumulative weekly effect of air pollution on health. They are also useful tomore finely investigate the dynamics of the relationship between the exposure and response.As an example, onemight postulate that after an increase in air pollution, high air pollution levels on later days may have a smaller impact, as the risk set has been depleted from the initial increase (Dominici et al., 2002; Schwartz, 2000; Zeger et al., 1999). We focus on distributed lag models that relate the current-day disease rate to particulate

matter levels over the past week. That is, our model includes the current day’s PM2.5 levels as well as the previous six days. While direct estimation of the effect for any particular lag is possible, joint estimation of the distributed lag model is not possible (see Section 20.3). Moreover, missing-data imputation for counties with patterned missing data is difficult. We consider a situation where several independent time series are observed at different geographical regions, somewith complete PM2.5 data. We usemultilevel models to borrow information across series tofill in themissingdata viaBayesian imputation. Thehierarchical model is also used to combine county-specific distributed lag effects into national estimates. The rest of the chapter is organized as follows. In Section 20.2 we outline the data set

used for analysis and follow in Section 20.3 with a discussion of Bayesian imputation. In Section 20.4 we describe the distributed lag models of interest, and in Section 20.5 we illustrate a multiple imputation strategy. Section 20.6 uses the imputation algorithm to analyze hospitalization rates of chronic obstructive pulmnonary disease (COPD). Finally, Section 20.7 gives some conclusions, discussion and proposals for future work.