Introduction | 7 | Topics in Modelling of Clustered Data

ABSTRACT

In applied sciences, one is often confronted with the collection of correlated data. This generic term embraces a multitude of data structures, such as multivariate observations, clustered data, repeated measurements, longitudinal data, and spatially correlated data. Historically, multivariate data have received the most attention in the sta-

tistical literature (e.g., Seber 1984, Krzanowski 1988, Johnson and Wichern 1992). Techniques devised for this situation include multivariate regression and multivariate analysis of variance. In addition, a suite of specialized tools exists such as principal components analysis, canonical correlation analysis, discriminant analysis, factor analysis, cluster analysis, and so forth. The generic example of multivariate continuous data is given by Fisher’s iris

data set (e.g., Johnson and Wichern 1992), where, for each of 150 specimens, petal length, petal width, sepal length, and sepal width are recorded. This is diﬀerent from a clustered setting where, for example, for a number of families, body mass index is recorded for all of their members. A design where, for each subject, blood pressure is recorded under several experimental conditions is often termed a repeated measures study. In the case that body mass index is measured repeatedly over time for each subject, we are dealing with longitudinal data. Although one could view all of these data structures as special cases of multivariate designs, there clearly are many fundamental diﬀerences, thoroughly aﬀecting the mode of analysis. First, certain multivariate techniques, such as principal components, are hardly useful for the other designs. Second, in a truly multivariate set of outcomes, the variance-covariance structure is usually unstructured and hardly of direct scientiﬁc interest, in contrast to, for example, clustered or longitudinal data. Therefore, the methodology of the general linear model is too restrictive to perform satisfactory data analyses of these more complex data. Replacing the time dimension in a longitudinal setting with one or more

spatial dimensions leads naturally to spatial data. While ideas in the longitudinal and spatial areas have developed relatively independently, eﬀorts have been spent in bridging the gap between both disciplines. In 1996, a workshop was devoted to this idea: “The Nantucket Conference on Modeling Longitudinal and Spatially Correlated Data: Methods, Applications, and Future Directions” (Gregoire et al . 1997). Among the clustered data settings, longitudinal data perhaps require the

most elaborate modeling of the random variability. Diggle, Liang, and Zeger (1994) distinguish among three components of variability. The ﬁrst one groups traditional random eﬀects (as in a random-eﬀects ANOVA model) and random coeﬃcients (Longford 1993). It stems from interindividual variability (i.e., heterogeneity between individual proﬁles). The second component, serial association, is present when residuals close to each other in time are more similar than residuals further apart. This notion is well known in the timeseries literature (Ripley 1981, Diggle 1983, Cressie 1991). Finally, in addition to the other two components, there is potentially also measurement error. This results from the fact that, for delicate measurements (e.g., laboratory assays), even immediate replication will not be able to avoid a certain level of variation. In longitudinal data, these three components of variability can be distinguished by virtue of both replication as well as a clear distance concept (time), one of which is lacking in classical spatial and time-series analysis and in clustered data. These considerations imply that adapting models for longitudinal data to

other data structures is in many cases relatively straightforward. For example, clustered data of the type considered in this book can often be analyzed by leaving out all aspects of the model that refer to time. In some cases, a version of serial association can be considered for clustered data with individual-level exposures. We refer to Chapter 4 for an overview of the modeling families that arise in this context. A very important characteristic of data to be analyzed is the type of out-

come. Methods for continuous data form no doubt the best developed and most advanced body of research; the same is true for software implementation. This is natural, since the special status and the elegant properties of the normal distribution simplify model building and ease software development. A number of software tools, such as the SAS procedure MIXED, the SPlus function lme, and MLwiN, have been developed in this area. However, also categorical (nominal, ordinal, and binary) and discrete outcomes are very prominent in statistical practice. For example, quality of life outcomes are often scored on ordinal scales. In many surveys, all or part of the information is recorded on a categorical scale. Two fairly diﬀerent views can be adopted. The ﬁrst one, supported by

large-sample results, states that normal theory should be applied as much as possible, even to non-normal data such as ordinal scores and counts. A diﬀerent view is that each type of outcome should be analyzed using instruments that exploit the nature of the data. Extensions of GLIM to the longitudinal case are discussed in Diggle, Liang, and Zeger (1994), where the main emphasis is on generalized estimating equations (Liang and Zeger 1986). Generalized linear mixed models have been proposed by, for example, Breslow and Clayton (1993). Fahrmeir and Tutz (1994) devote an entire book to GLIM for multivariate settings. Subscribing to the second point of view, we will present methodology speciﬁc to the case of categorical data. The main emphasis will be on clustered binary data from developmental toxicity studies

(Section 1.2) and from survey data (Section 1.3). However, the modeling and analysis strategies described in this text have a much broader applicability. In clustered settings, each unit typically has a vector Y of responses. This

leads to several, generally nonequivalent, extensions of univariate models. In a marginal model , marginal distributions are used to describe the outcome vector Y , given a set X of predictor variables. The correlation among the components of Y can then be captured either by adopting a fully parametric approach or by means of working assumptions, such as in the semiparametric approach of Liang and Zeger (1986). Alternatively, in a random-eﬀects model , the predictor variables X are supplemented with a vector b of random (or cluster-speciﬁc) eﬀects, conditional upon which the components of Y are usually assumed to be independent. This does not preclude that more elaborate models are possible if residual dependence is detected (Longford 1993). Finally, a conditional model describes the distribution of the components of Y , conditional onX but also conditional on (a subset of) the other components of Y . Well-known members of this class of models are log-linear models. Several examples are given in Fahrmeir and Tutz (1994). For normally distributed data, marginal models can easily be ﬁtted, for ex-

ample, with the SAS procedure MIXED, the SPlus function lme, or within the MLwiN package. For such data, integrating a mixed-eﬀects model over the random eﬀects produces a marginal model, in which the regression parameters retain their meaning and the random eﬀects contribute in a simple way to the variance-covariance structure. For example, the marginal model corresponding to a random-intercepts model is a compound symmetry model that can be ﬁtted without explicitly acknowledging the random-intercepts structure. In the same vein, certain types of transition models induce simple marginal covariance structures. For example, some ﬁrst-order stationary autoregressive models imply an exponential or AR(1) covariance structure. As a consequence, many marginal models derived from random-eﬀects and transition models can be ﬁtted with mixed-models software. It should be emphasized that the above elegant properties of normal models

do not extend to the general GLIM case. For example, opting for a marginal model for clustered binary data precludes the researcher from answering conditional and transitional questions in terms of simple model parameters. This implies that each model family requires its own speciﬁc analysis and, consequently, software tools. In many cases, standard maximum likelihood analyses are prohibitive in terms of computational requirements. Therefore, speciﬁc methods such as generalized estimating equations (Chapter 5) and pseudo-likelihood (Chapters 6 and 7) have been developed. Both apply to marginal models, whereas pseudo-likelihood methodology can be used in the context of conditional models as well. In case random-eﬀects models are used, the likelihood function involves integration over the random-eﬀects distribution for which generally no closed forms are available. Estimation methods then either employ approximations to the likelihood or score functions, or resort to numerical integration techniques. Some estimation methods have been

implemented in standard software. For example, an analysis based on generalized estimating equations can be performed within the GENMOD procedure in SAS. Mixed-eﬀects models for non-Gaussian data can be ﬁtted using the MIXOR program (Hedeker and Gibbons 1994, 1996), MLwiN, or the SAS procedure NLMIXED. In many cases, however, specialized software, either commercially available or user-deﬁned, will be needed. In this book, we will focus on clustered binary data, arising from devel-

opmental toxicity studies, complex surveys, etc. These contexts will be introduced in the remainder of this chapter, whereas actual motivating examples will be introduced in Chapter 2. After discussing speciﬁc and general issues in modeling such data (Chapter 3), and reviewing the model families (Chapter 4), speciﬁc tools for analysis will be presented and exempliﬁed in subsequent chapters. While the emphasis is on binary data, we also deal with the speciﬁcs of continuous outcomes (Chapter 13) and mixtures of binary and continuous outcomes (Chapter 14). Apart from model formulation and parameter estimation, speciﬁc attention is devoted to assessing model ﬁt (Chapter 9), quantitative risk assessment (Chapter 10), model misspeciﬁcation (Chapter 11), exact dose-response inference (Chapter 12), and individuallevel covariates (Chapter 13), as opposed to cluster-level covariates. In the next sections, we will deal with the speciﬁcs of developmental toxicity

studies and complex surveys.