ABSTRACT

Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee

Christopher Cassa

Children’s Hospital Informatics Program, Harvard-MIT Division of Health Sciences and Technology, Boston, Massachusetts

Murat Kantarcioglu

Computer Science Department, University of Texas at Dallas, Dallas, Texas

Numerous government-sponsored ventures, such as the U.S. Department of Health and Human Services’ Personalized Health Care Initiative [22], have fueled growth in personalized medicine research and its application. This, in combination with increasing ubiquity of clinical information systems and DNA sequencing technology, has stimulated an explosion in the quantity of patientspecific clinical and genomic data stockpiled in electronic medical records [1, 35] and biomedical research environments [40, 96]. Until recently, the collection, analysis, and application of clinical and genomic data were mainly localized to specific investigators or institutions. [81] Increasingly, however, scientists need to share data to strengthen the statistical power of complex association studies, allow the research community the opportunity to repli-

cate and verify clinically-relevant findings, and comply with a host of laws and regulations. [29] To assist in data sharing, agencies around the globe have invested consider-

able monetary and social capital to construct information technology infrastructure, such as the Database of Genotype and Phenotype (dbGaP) at the U.S. National Library of Medicine, National Institutes of Health (NIH) [61], which will facilitate the consolidation, standardization, and dissemination of person-specific records from disparate investigators. The availability of such databanks for widespread data mining activities is contingent on the protection of patient anonymity [70] and, while biomedical privacy policies and protection technologies exist, many studies (e.g., [64]) show they are vulnerable to various privacy-compromising attacks. This is particularly a concern because demographic and clinical data derived from patients’ medical records are increasingly shared [56], thus heightening the probability that sensitive information can be “re-identified” to the originating patient [58]. As DNA sequences cannot be revoked or changed once they are released, any disclosure of such data poses a life-long privacy risk. Beyond the clinical environment, there is an increasing push to circumvent

anonymity issues and make personal genomic information available in a public setting in a fully identifiable format. A clear example of this grassroots movement is the Personal Genome Project [12], which is publishing the complete genome sequences and fully identified medical records of ten initial volunteers for its PGP-10 project with hopes of scaling up to 100,000 volunteers [23]. However, even as such projects march toward an “open” access environment, there is still a concern over privacy. Several years ago, for instance, James Watson, one of most preeminent scientists of the twentieth century, and a codiscoverer of the double-helix model of DNA, agreed to have his full genome sequenced and made available in an online searchable format. [103] That is, all of his genome, minus the sequence for his APOE gene, which is an indicator of an individual’s potential to develop Alzheimer’s disease. [84] Following suit, the same request for APOE suppression has been made by members of the PGP-10 program as well. [80] Despite such requests, the residual information in one’s genome sequence may be sufficient to infer the status of a suppressed gene, and several scientists recently reported on a statistical model developed to infer Watson’s APOE gene status with a high degree of certainty. [75] The aforementioned violation is only one of many concerns looming for

biomedical data sharing. In this chapter, we review various threats to privacy in the context of genomic data collection, sharing, and mining. In general, we believe that there are several aspects of genomic data that make it unique with respect to databases, data mining, and privacy. First, the inclusion of genomic data into the clinical realm creates a complex regulatory context (e.g., medical privacy laws and data sharing requirements) that is not found elsewhere. Second, the semantics of the data itself are unique in that it allows for the direct inference of genetic and clinical information of familial relations that have not consented to having their data collected or shared. Third, genomic

Mining

data consists of a very high-dimensional space (each individual’s genome is over 3 billion pieces of DNA), which allows for it to be tracked across locations and poses a challenge to standard database obfuscation techniques. To derive and apply real privacy protection solutions, it is necessary to understand the interplay of these facets of the problem. The focus of this paper is not on the specific technical minutiae of the at-

tacks, the database architecture in which such attacks are applicable, or the details of how to prototype the solutions to thwart the problems. Rather, due to the fact that many genomic database systems are currently in initial development or beta stage, we aim to provide the reader with a broader perspective of how such privacy violations come to be and the potential ways in which they can be addressed in the context of such data. In doing so, we hope to provide genomic database and system architects with general guidelines and reasoning tools they can apply when considering potential privacy concerns. We provide the reader with pointers to further reading and details beyond various attacks and protection mechanisms as necessary. The remainder of this chapter is organized as follows. Section 12.2 surveys

the social mechanisms that have led to biomedical data sharing as well as certain policies that govern and proscribe privacy protection requirements. Given this regulatory basis, Section 12.3 reviews various approaches that have been devised for re-identifying seemingly anonymous biomedical records, as well as inferring knowledge suppressed from such records. Section 12.4 follows with a review of emerging computational models and technologies for formally protecting biomedical records from privacy violations. We close this chapter in Sections 12.6 and 12.7, where we discuss the direction of this field and opportunities for genomic data privacy methods development, deployment, and integration.