ABSTRACT

As discussed in Chapter 4, two main approaches to the analysis of clustered binary data are the cluster-specific (CS) approach (Section 4.3) and the population-averaged (PA) approach (grouping the conditional and marginal approaches of Sections 4.2 and 4.1). Cluster-specific models include cluster effects and thus are useful for assessing the effects of individual-level covariates. Individual-level covariates may take on different values, either by design or chance, for every unit in the cluster. These have also been referred to as cluster-varying covariates in the literature, since the values may vary within a cluster. Examples of CS models are mixed-effect logistic regression, with either parametric or nonparametric mixing distributions for the cluster effects, and conditional logistic regression. A number of cluster-specific approaches have been introduced in Section 4.3. In contrast, population-averaged models do not include cluster effects, and thus are most useful for assessing the effects of cluster-level covariates. Cluster-level covariates take on the same values for every unit in the cluster. The effects of individual-level covariates can also be estimated from population-averaged models, but their interpretations are based on the overall population, without adjusting for cluster effects. Quasi-likelihood models and models based on generalized estimating equations (GEEs, Chapter 5) fall under the heading of PA models. Excellent reviews of these modeling approaches for clustered binary data are provided by Prentice (1988), Fitzmaurice, Laird and Rotnitzky (1993), Diggle, Liang and Zeger (1994), and Pendergast et al. (1996). Three examples will serve to illustrate the concepts of cluster-level ver-

sus individual-level covariates. First, consider a developmental toxicity study which evaluates the occurrence of fetal malformations in response to an environmental or chemical exposure. Clustered data result from the fact that binary outcomes (malformation versus no malformation) are evaluated on the

offspring, while the exposure is administered to the pregnant female. In most developmental toxicity studies, as for most toxicity studies in general, the primary interest is in evaluating dose-response effects. Since the exposure level is a cluster-level covariate, many models encountered in the developmental toxicity literature are “population-averaged” (PA) models. In particular, GEEs have become a very popular choice for the analysis of developmental toxicity studies (Ryan 1992). In contrast, consider a study conducted in 52 human subjects, 23 of whom

were HIV-infected, in order to determine whether the lymphocyte proliferation assay (LPA) could be run on blood samples which had been shipped or stored rather than requiring fresh blood samples (Weinberg et al. 1998, Betensky and Williams 2001). The LPA measurements were performed on up to 36 combinations of conditions on each subject’s blood sample, reflecting the three possible storage methods (fresh, shipped, or stored blood samples), three different anticoagulants, and four possible stimulants. In this study, the individual subject defines the cluster and the repeated LPA measurements on each subject form the clustered outcomes. The primary interest focused on the handling method, which is a cluster-varying (i.e., individual-level) covariate. Anticoagulant and stimulant are also individual-level covariates, since they pertain to the processing of an individual blood sample within a study subject. However, HIV infection status is a cluster-level covariate since it remains constant for each study subject. For this type of study, cluster specific models are likely to be more appropriate. Last of all, suppose a multicenter clinical trial has been conducted in HIV-

infected subjects to compare the effects of two combination antiretroviral regimens on HIV-1 RNA viral load. The viral load may be analyzed as a continuous outcome after log-transformation (ignoring, for the moment, the issue of censored data resulting from measurements below the limit of quantification of the viral load assay), or by dichotomizing as above or below the limit of quantification. In either case, such studies often measure the viral load at each clinic visit, resulting in repeated measurements for each subject. One of the primary interests of such a study might be to compare the trajectories of viral load over follow-up time between patients randomized to the two regimens. In this scenario, treatment regimen is a cluster-level covariate, while week on therapy is a cluster-varying covariate. However, because primary interest revolves around identifying treatment differences, a population-averaged model would be appropriate. In contrast to models for dependent continuous outcomes, the two ap-

proaches for dependent binary data produce parameters with different interpretations and actually address different questions. From the above examples, it should be clear that the choice between one modeling approach or another depends primarily on the scientific question of utmost importance to the study. However, there may be instances in which both cluster-level and individual-level covariates are of interest within the same study. For example, in the heatshock studies described in Section 2.2, the embryos are explanted

from the uterus of the maternal dam and exposed in vitro to various combinations of heat stress (increased temperature) and exposure duration; thus, the exposure covariate does not remain constant within a litter. Yet the genetic similarity of offspring from the same litter may still induce an intralitter correlation. Analysis of such data with a CS model may allow distinction between the genetic and environmental components of the intralitter effect, whereas studies with only cluster-level covariates can account for a “litter effect” but cannot disentangle this any further. In some such cases it may be reasonable to consider both approaches as equally valid. In these situations, issues of efficiency and robustness should be considered. In this chapter, modeling approaches are described for addressing individual-

level covariates in the context of clustered binary outcomes. Then, clusterspecific models for binary data are addressed, and are further broken down into conditional and marginal inferential approaches (Section 13.1). In Section 13.2, population-averaged models for binary data are reviewed, and are similarly subdivided into conditional and marginal model forms. The marginal models are further classified as likelihood-based versus those based on generalized estimating equations. Issues of efficiency are discussed in Section 13.3, for situations in which more than one modeling approach might produce estimates with valid interpretations. An example of analysis by these various modeling approaches for the particular case of the heatshock data (See Section 2.2) is provided in Section 13.4. In Section 13.5, cluster-specific (or random-effects) models are discussed when the outcomes of interest are continuous. Let us first turn attention to binary outcomes.