ABSTRACT

Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey

Mehmet Ercan Nergiz

Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey

Sequence data can be defined as ordered data units which may be extended with other dimensions such as space and time for specific applications. Many forms of new and emerging data types are actually in the form of sequences. For example, DNA or protein sequences from bio-informatics, click streams from the Web, and spatio-temporal trajectories of moving objects from GPS or GSM applications are few of such popular data types. Masses of collected sequence data, naturally attracted data mining researchers to develop algorithms for extracting patterns or descriptive models from sequences. In fact, one of the early data mining papers considers customer purchase sequences as the target data source [4] to discover frequently appearing sequences of customer purchases. Another seminal data mining paper is on mining frequent episodes from event sequences which proposed methods for discovering event sequences in the form of sequential or concurrent events. [17]. The sequence data type is versatile enough to provide data models for many

applications. For example DNA is a sequence of amino acids, and time series is an ordered sequence of temporally annotated values. When we include location as well as time information, then we obtain spatio-temporal traces of

moving objects in the form of trajectories. Knowledge hidden in masses of raw sequence data could be harvested with data mining technology obtaining frequently occurring subsequences, or patterns within sequence data, clusters of sequences, or predictive models. For example, in the case of trajectory data which is a sequence of time-stamped locations of an individual or an object in general, we may find the frequently followed paths, in the form of subsequences, or we can obtain places periodically visited by the same person. These frequent subsequences show the movement habits of individuals. Using sequence clustering techniques, we can obtain the collective movement behavior of people at different periods of time. Such descriptive models could be used for traffic management and city planning. With classification algorithms, we can learn the activities of people with respect to their movement behavior, movement speeds, etc. In the light of these applications, one can easily see that sequence data contains valuable information to be harvested with data mining technology. Besides being an invaluable source of information, sequence data about in-

dividuals are also extremely sensitive from a privacy perspective. DNA data contain a lot of information about how prone we are to certain types of disease, which we may not want to share with health insurance companies. A typical example is the Huntington’s disease which is a neurological degenerative disease that affects people after their 30’s and shows itself by degeneration in cognition and movement. There are genetic predictive tests for persons who have an affected parent or relative. The test can predict in almost all cases whether the subjects will develop the disease at some stage in their life. Our trajectories also contain very sensitive information since they show the places we visit, our habits and so on. An interesting work by Malin and Sweeney shows that location data and genomic data may in fact be used together to identify genetic information of individuals [16]. Their re-identification method is based on the fact that people visit multiple hospitals in sequence, which collect both sensitive data such as our DNA, and non-sensitive data which can be linked with further analysis. Privacy issues need to be considered when performing data mining tasks

on sensitive data which includes sequence data as well. From a general perspective, privacy may be compromised when the released data contain private information. Privacy may also be compromised even when only the data mining models are released [10]. For example, there may be patterns which could be linked to individuals. This issue has been investigated in the context of association rules in [26, 6] where rules with very high confidence could be used to infer sensitive values. Therefore, privacy can be handled at the data preprocessing level or model construction level depending on how the data mining is going to be performed and by whom. When the datasets are going to be released, proper de-identification tech-

niques need to be applied to prevent the private data to be linked with personal identifiers [24, 25]. One effective way of de-identification is anonymization in which data values (e.g., nation: USA, age: 12) are replaced with more gen-

Domains

eral values (e.g., nation: America, age: [10-20]) to create a certain degree of indistinguishability between sets of tuples. Research on this area proposed several privacy metrics to address various types of applications. Among these metrics, k-anonymity [24, 25] ensures each individual in the dataset is indistinguishable with at least k − 1 other individuals, however does not fully protect against the identification of sensitive attributes. ℓ-Diversity [14], tcloseness [13], (α,k)-anonymity [29], and discernibility [22] enforce diversity in the sensitive attributes of indistinguishable groups thus bound the probability that an individual in the dataset possesses a sensitive attribute. Another metric, δ-presence [19], protects the existence or non-existence of individuals in a released dataset when the information on the presence of individuals is considered sensitive (e.g., releasing a diabetes dataset). Although the early anonymization algorithms did not consider data min-

ing as the main utility of the released data, later on this has changed and data mining aware anonymization techniques have been developed [31, 7, 12]. A recently proposed method for privacy preserving data mining is based on condensation of the data which was partially inspired by the notion of kanonymity. The main idea of condensation is to release the “condensed drops rather than the vapor itself” [2]. In database terms, condensation-based approach partitions the multidimensional data into groups with well-defined statistical properties which are essential for data mining. This ensures privacy at the data preprocessing step of knowledge discovery process. For example, suppose a dataset contains a vector of values <10,11,13,20>. An anonymization would make these values indistinguishable by replacing them with their generalizations; <[10-20],[10-20],[10-20],[10-20]> where a-b denotes the range of values from a to b. Note that such a process destroys the statistics (e.g., mean, variance, · · · ) in the original data. A condensation-based approach would return four values drawn from a random variableX with domain 10-20. We pick the distribution of X such that we preserve enough statistics from the data. An example distribution might be a normal with mean and variance of the original values. Such a condensation of the vector might look like <11,12,13,17>. In this chapter, we will first define the general notion of sequences which

can be extended with other dimensions such as time and space for different applications. Then we will briefly discuss the state of the art data mining algorithms on sequence data; after that we will look at the privacy issues arising from the emerging applications. We will then discuss how the currently proposed condensation-based methods for privacy preserving data mining apply for sequence data, and provide future directions for research.