ABSTRACT

Div. of Mathematical and Computer Sciences and Engineering, The King Abdullah University of Science and Technology, Saudi Arabia

Nowadays, most human activities leave some trace in a digital registry. Whether one buys something in a supermarket or gets admitted in the hospital, an entry to her credit card history or to her medical record will be made. Data collected this way can be used to improve the quality of services, e.g., companies can adjust faster to the needs of the consumers, or even contribute to science advancement; doctors can study the medical histories of thousands or even millions of patients. On the other hand, such data collections pose a significant threat to the privacy of the individuals that are associated with them. Even if some information that directly identifies a person (e.g., the name or the social security number) is removed from a database record, the individual that is associated to the record can be identified by linking the remaining information (e.g., age, zip code of residence) to external knowledge (e.g., a public voters table). In most privacy protection settings, we assume that a part of the dissem-

inated information is publicly known and can be used as a link between the

(a) original (b) anonymized

FIGURE 2.1: Example database

anonymized database and the external knowledge possessed by the attacker. The attributes that comprise this linking information are usually termed the quasi identifier. For example, in the database of Fig. 2.1.a, where medical information about patients is revealed, “age” and “region” form the quasi identifier, since an attacker might have this knowledge about individuals from another source. Using this transformation she could identify the associated person to each record. For example, if there is a unique person in a demographics database who is age 22 and resides in Athens, then an observer of the database of Fig. 2.1.a can infer that he/she is infected by AIDS. With k-anonymity [23] the initial database is transformed in such a way that any tuple is indistinguishable from another k− 1 tuples, based on the quasi identifier. Each group of tuples with identical quasi identifier is an equivalence class. A simple transformation that would prohibit linkage to an external knowledge source would be to generalize the values of the quasi identifiers in order to achieve equivalence classes of cardinality at least k. For instance, the transformed table of Figure 2.1.b satisfies 3-anonymity, since each entry is indistinguishable from at least 2 more entries based on their quasi identifier. Anonymization algorithms have to balance two conflicting aims: a) the provision of the required privacy guarantee and b) to keep as much as possible of the original information. This is translated in an effort to perform the minimum amount of transformations for achieving the desired privacy guarantee. In most cases the transformations create equivalence classes where all quasi identifiers are identical. To achieve this they employ a variety of techniques like generalization, suppression and perturbation. l-diversity [17] imposes an even stricter requirement to k-anonymity; the publicly known information cannot be linked with any sensitive value on the anonymized table with probability higher than 1/l. For example, the transformation shown in Figure 2.1.b may not be acceptable, since the probability (2/3) of a tuple in the “Northern Greece” equivalence class to be associated to cancer may be considered too high. If the domain of the quasi identifiers is small compared to the size of the

database, then it is likely that many different entries will share the same quasi

Data

identifiers and few transformations will be needed. Imagine for example that in the database of Figure 1.a we only have 100 different values for age (1-100 years old) and 10 distinct regions. There are 1000 different quasi identifiers based on these attribute domains. If the database contained millions of entries and we wanted to achieve k-anonymity for a very small k, it is quite likely that no transformations would be required. On the other hand, if the database contained more attributes that could be used as quasi identifiers, e.g., nationality, gender, weight, height etc, then the anonymizer’s job would be harder; the domain of a quasi identifier grows exponentially to the number of its attributes and it would be a lot easier for the adversary to uniquely identify a person in the original database. Anonymization techniques can be seen as efforts to bring together data

points scattered in the quasi-identifier space. Collections of such points are replaced by a single point or by a multi-dimensional region that encloses them. Most utility metrics that measure the quality of the data are affected by the scattering of the data which are grouped together. Thus, quality is disproportional to the volume of the multi-dimensional region that encloses an equivalence group; the more clustered the original data are, the less information loss is introduced when they are placed in the same equivalence class. As the number of attributes in the quasi identifiers grows, it becomes harder to detect clusters of high quality, and the information loss is very high. This effect, known as the curse of dimensionality, appears also in other areas of data management, for example in indexing. For example, it has been shown that as dimensionality of the data increases there is always a point where the privacy cannot be protected by any other means but for completely suppressing all the data [4] (or generalizing to one point). While publishing very high dimensional data with a privacy guarantee seems

impossible in its general case, in many real-world cases it becomes feasible, due to the fact that data in these applications are sparse, i.e., they have nontrivial values only in a small subset of their dimensions. Examples of such data are numerous; market basket data are sparse multidimensional binary vectors with each different product being a distinct dimension, whose domain takes two values 1 (bought) and 0 (not bought). The same holds for medical histories, each different disease is a distinct dimension with a domain of two values indicating if the specified disease appears in the medical history or not. Basically, any type of data where we can have an arbitrary number of different entries for the same person must be treated as a multidimensional single entry, since the combination of these individual entries can be used to identify the associated person. This chapter provides a survey of the anonymization techniques that appear

in the research literature for sparse multidimensional data. In Section 2.2 we provide a detailed overview of the some generic anonymization techniques that do not specialize in a single type of data processing or in data from a specific application area. We focus mostly on three methods: the multirelational kanonymity, an adjustment of l-diversity to the sparse multidimensional space

and the km-anonymity. In Section 2.3 we examine the privacy problems and the anonymization solutions for web log data. Finally, in Section 2.4 we conclude the chapter.