Issues with Privacy Preservation in Query Log Mining

doi:10.1201/b10373-26

ABSTRACT

The Web has profoundly changed the way we live and work in a very short period of time. One of the main drivers for that change is trust. Trust encompasses trusting information found in the Web, trusting e-commerce applications, trusting that personal data will not be shared, etc. The main trust discussed in this chapter, is that of trusting that search engines keep the queries issued by a person as private information. In each of the cases mentioned above, guarding and enhancing that trust

should be one of the main goals of Web service providers. In the case of search engine query logs, this is crucial for their business success. In particular, we show that even a sequence of queries can disclose private information, and that it is important to be aware of what type and how much information we are potentially exposing when using a search engine. Query log mining is of great interest to industrial and academic researchers

because of its tremendous capability to provide useful information. Applications of query log mining range from improving the performance, quality and functionality of search engines [7, 35, 30], to extracting knowledge from the wisdom of the crowds of users searching the Web. The world knowledge that can be derived from query logs comprehends knowledge about language, including semantic relationships [17, 8], spelling [16] and even sociology [31]. In August of 2006, the online services provider America Online (AOL) made

a public release of 6 months of anonymized Web search logs [4]. The intention was to aid the academic research community, but the outcome was a public relations nightmare. One of the people whose queries were in the log was quickly identified by journalists from The New York Times [11]. This led to a great increase in public awareness of the potential privacy risks involved in Web search. This confidentiality issue is not only important for people, but can also

become a problem for businesses. For example, all of the queries issued from a company, may not disclose any private information for an individual, but may disclose information that is considered confidential by a company. As we will discuss later, this type of business privacy breach may not even be as explicit as it is in our example. An important technique used for analysis of privacy preservation is k-

anonymity [32, 34, 33, 36], which we describe in Section 14.3. In this work, we walk-through the possible applications of k-anonymity to independent query log issues, and discuss the existing problems for the success of this technique on real query logs. This chapter is organized as follows. In Section 14.2 we define the notion of

privacy in our context, we characterize query logs and the risks behind sharing them. In Section 14.3 we discuss the problem of query log privacy preservation from a k-anonymity perspective. Section 14.4 gives an overview of the main and newer privacy-enhancing techniques for query log privacy preservation. An excellent survey of query log privacy techniques from a policy perspective is provided by Cooper in [15]. We complement this paper by analyzing the state of the art with the k-Anonymity framework, and update it by covering the most recent work in the area. We close the chapter by discussing the main issues still unsolved with privacy preservation for query logs.