ABSTRACT

When disease incidence locations are observed in a region, there is often interest in studying whether there is clustering about landmarks representing possible centralized sources of the disease. In this article we study a Bayesian approach to the detection and estimation of such landmarks. Spatial point processes are used to specify both the observation process and the prior distribution of the landmarks. We develop a perfect sampling algorithm for the posterior distribution of landmarks under various conditions on the prior and likelihood. Bayesian cluster models of the type we consider were introduced by Baddeley and van Lieshout (1993), primarily for applications in computer vision. The dissertation of van Lieshout (1995) (see also Chapter 4) focused on the special case of the Neyman-Scott cluster process, in which the observations arise from a superposition of inhomogeneous Poisson processes associated with each landmark (Neyman and Scott 1958) and she applied it to the well-known redwood seedling data used by Strauss (1975). Hurn (1998) applied the Baddeley-van Lieshout approach to the study of changes in the size and shape of living cells. Lawson and Clark (1999b) survey the statistical literature on disease clustering models. Markov chain Monte Carlo (MCMC) techniques are indispensable for the application of point process models in statistics, see, for example, the survey of Møller (1999). The typical MCMC sampler obtains draws that are at best only approximately from the target distribution, and are often plagued by convergence problems, even when a long “burn-in” period is used. Moreover, if independent draws are required, then every draw must be produced by a separate chain. However, using an algorithm developed by Kendall and Møller (2000), it is possible to sample perfectly from the posterior distribution in the Bayesian cluster model. Perfect samplers originate in the seminal work of Propp and Wilson (1996), whose coupling from the past (CFTP) algorithm delivers an exact draw from the target distribution. The most important practical advantage of perfect samplers

over traditional MCMC schemes is that the need to assess convergence of the sampler is eliminated. There are many examples of perfect samplers in the literature. Mira et al. (2001) apply perfect simulation to slice samplers for bounded target distributions; Casella et al. (1999) create perfect slice samplers for mixtures of exponential distributions and mixtures of normal distributions. For an example of perfect sampling of a conditioned Boolean model, see Cai and Kendall (1999). Ha¨ggstro¨m et al. (1999) obtain perfect samples from the area-interaction point process (attractive case) using auxiliary variables. Another version of perfect sampling called read-once CFTP, which runs the Markov chain forward in time and never restarts it at previous past times (thus avoiding the need to store random numbers for reuse), is given by Wilson (2000). For recent applications of perfect simulation in statistics, see Green and Murdoch (1999) and Møller and Nicholls (1999). The Kendall-Møller version of perfect simulation was developed for spatial point processes that are locally stable. We show that the posterior distribution in the Bayesian cluster model is locally stable on its support, provided the prior is locally stable and the likelihood satisfies some mild conditions. In particular, this shows that the posterior density is proper (has unit total mass). The Kendall-Møller algorithm is computationally feasible, however, only when the locally stable point process is either attractive (favoring clustered patterns) or repulsive (discouraging clustered patterns), or a product of attractive and repulsive components. We examine the feasibility of the sampler in two special cases: the NeymanScott process and the pure silhouette model (in which only the support of the observations is determined by the landmarks). Our approach is applied to data on leukemia counts in an eight county area of upstate New York during the years 1978-82. There is an extensive literature on the analysis of these data, the main goal being the detection of disease clusters; see, for example, Ghosh et al. (1999), Ahrens et al. (1999) and Denison and Holmes (2001). The study area includes 11 inactive hazardous waste sites, and it is natural in any study to attempt to determine if any of these sites can be seen as a contributor to the incidence of leukemia in the area. We find evidence for an elevated leukemia incidence in the neighborhood of one of the sites. The paper is organized as follows. Section 5.2 gives background material and describes the general model. In this section we also state the main result of the paper showing that the posterior is locally stable on its support. Section 5.4 examines several examples of the basic model, and addresses the question of whether perfect sampling is feasible in each case. Section 5.5 contains the disease clustering application, and a further application to the classic redwood seedlings data is given in Section 5.6.