ABSTRACT

Distributed Data Intensive Systems Lab, College of Computing, Georgia Institute of Technology, Atlanta, Georgia

Ling Liu

Distributed Data Intensive Systems Lab, College of Computing, Georgia Institute of Technology, Atlanta, Georgia

Privacy preservation is widely recognized as an important requirement in large-scale data integration and data mining systems that involve private personal information, e.g., medical data or census data. Individuals were usually unwilling to contribute their information if they knew that the privacy of the data could be compromised. A plethora of work has been done on preserving input privacy for static data [4, 8, 17, 26, 29], which assume an untrusted mining-service provider and enforces privacy restrictions and regulations by modifying the raw data before performing the mining task, wherein the goal is to achieve a desired balance between privacy and utility. In a strict sense, however, privacy preservation not only refers to prevent-

ing unauthorized access to raw data that could lead to leakage of sensitive information, but also includes eliminating unwanted disclosure of sensitive patterns through mining output; that is, the published mining output should not be informative enough to infer properties possessed only by a few individuals, which we refer to as output privacy. The existence of output-privacy risk is explained by the fact that existing input-privacy-preserving techniques are designed to make the patterns constructed over sanitized data as close as, if not identical to, that built over raw data, in order to guarantee the output

utility. This no-outcome-change property is considered as a pillar of privacypreserving data mining [6]. As long as the significant statistical information of raw data is preserved, the risk of disclosing private information exists; also, one can conclude that the preservation of input and output privacy is not necessarily equivalent. For instance, consider a nursing-care database that records the observed

symptoms of the patients in a hospital. By mining such a database, one can discover valuable information about syndromes characterizing particular diseases. The released mining output, however, can also be leveraged to uncover combinations of symptoms that are so special that only rare people match them. This qualifies as severe threats to individuals’ privacy. Assume that Alice knows that Bob has certain symptoms a, b, but not c (c¯). By analyzing the mining output, she finds that only one person in the hospital matching the specific combination of {a, b, c¯}, and only one having all {a, b, c¯, d}. She can then conclude that the one must be Bob, who also has the symptom d. Furthermore, by studying another medical database, she might learn that the combination of {a, b, c¯, d} is linked to a rare disease with high chance. This clearly qualifies as a privacy-intruding conclusion. The example above exposes an output-privacy breach in a single-time min-

ing output release; this issue is even more severe in stream mining, wherein the mining results need to be published in a continuous and in-time manner: not only a single-time release might contain privacy breaches, but also multiple releases could potentially be exploited in combination, given the overlap of the corresponding input data. This chapter presents a survey of a variety of countermeasures against

mining-output inference in literatures, with the aim of pinpointing individual research efforts on the grand map of output-privacy protection, and identifying open issues and technical challenges worthy of further investigation. The first part of this chapter presents an overview of privacy preservation

techniques in data mining. Specifically, we attempt to answer the following three questions. First, what is the current state of the research on privacy protection in data mining, especially stream mining? Second, what is the difference between input and output privacy protection? Third, is it sufficient to apply the input-privacy-preserving models and techniques developed to date for protecting output privacy? The second part is devoted to a brief survey of the research on output pri-

vacy by taking a close examination at a variety of countermeasures against adversarial inference over mining output. In particular, we discuss two categories of approaches, namely, reactive and proactive approaches, and analyze their strengths and weaknesses. In the final part of this chapter, we conduct a concrete case study of a

light-weighted countermeasure in the context of frequent-pattern mining over data streams, a well known data mining problem. We exemplify this case to illustrate the general principles of designing output-privacy solutions for data mining, especially stream mining. We conclude this chapter with discussing

FIGURE 6.1: Overall framework of privacy-preserving data mining. c© IEEE 2008 [35].