ABSTRACT

As more and more data become available electronically, data privacy has been a serious concern. The recent advance in data mining technology and its widespread adoption make privacy an even more critical issue. On the one hand, data producers must protect the privacy of their data due to business interests or government regulations. On the other hand, data producers want to publish their data for analysis and mining to find global patterns and trends which are not present in local data. To resolve the conflicting requirements, data mining researchers have proposed privacy preserving data mining [1, 14]. In privacy preserving data mining, data privacy is protected without seriously hurting data mining performance. Many approaches have been proposed for privacy preserving data min-

ing. They can be broadly grouped into two categories: data perturbation approaches and secure multiparty computing approaches. To preserve privacy, data perturbation approaches modify original data by adding noise, generalization, transformation, swapping, and so on. Secure multiparty computing approaches assume that data is distributed among multiple parties who securely compute global patterns without revealing data. Both categories of

approaches assume that data is in the format of a relational table with rows representing objects and columns representing attributes. A very popular kind of data in data mining is the time series. A time series

consists of data points measured at constant time intervals that make up the time dimension. Time series data mining has attracted a lot of interest due to the abundance of time series in nature and society, and their special characteristics not found in relational data. Examples of time series include daily stock prices, monthly sales, and daily weather. However, the time dimension also makes privacy in time series more sophisticated than relational data. With the high dimensionality of time series data, secure multiparty com-

puting approaches are impractical due to overhead cost in computation and communications. Meanwhile, many existing data perturbation approaches developed for relational data is ineffective if directly applied on time series data. For example, one method in data perturbation approaches adds random noise to original data and publishes the noised data. For time series data, if the noise is independent of the original data, the noise can be filtered out to reveal the original time series. In this chapter, we first discuss issues related to privacy protection in time

series data mining. Existing methods for preserving privacy are summarized and their applicability to time series data is examined. We then propose a method to preserve privacy in time series data mining by adding segmentbased noises. In our method, a time series is first divided into segments. In each segment, noise that is dependent on the segment data is added. Our method can prevent privacy breaching under attacks such as filtering and regression. We also study the effect of noise on classification accuracy. The method is implemented and tested using a real dataset. Its effectiveness in privacy protection and its impact on classification accuracy are reported. The rest of the chapter is organized as follows. In Section 11.2, issues re-

lated to privacy protection in time series data mining are discussed, including protection and threat of time series privacy, methods for privacy preserving data mining, and measurements of privacy breach. Our method for privacy preservation using segment-based noise is given in Section 11.3 with an algorithm. Experimental results and performance evaluation are presented in Section 11.4. Section 11.5 concludes our study and points to some future work.