ABSTRACT

Institute for Security, Technology Studies, and Society, Dartmouth College, Hanover, New Hampshire

Michael E. Locasto

University of Calgary, Calgary, Canada

David Kotz

Institute for Security Technology Studies, Dartmouth College, Hanover, New Hampshire

The sharing of network trace data provides important benefits to both network researchers and administrators. Sharing traces helps scientists and network engineers compare and reproduce results and the behavior of network tools. The practice of sharing such information, however, faces a number of obstacles. Network traces contain significant amounts of sensitive information about the network structure and its users. Thus, researchers wishing to share traces must “sanitize” them to protect this information. We distinguish the terms “anonymization” and “sanitization”: “anonymization” attempts to protect the privacy of network users, and “sanitization” attempts to protect the privacy of network users and the secrecy of operational network information. In contrast, freely sharing full-capture traces happens rarely and usually re-

quires either close, pre-established personal relationships between researchers or extensive legal agreements (as in the PREDICT repository [51]). Furthermore, most real-world traces contain a large volume of information with features along many different dimensions, making the problem of identifying and masking sensitive data non-trivial. It remains difficult to precisely specify a policy regarding the type and

structure of information that should be sanitized, let alone provide a reliable method that ensures the conclusive suppression of such information in the shared trace. Thus, two main categories of concerns arise: (1) legal and ethical obstacles to capturing information derived from human interaction for research purposes and (2) operational difficulties arising from a lack of effective tools and techniques for suppressing sensitive information. In this chapter, we survey a selection of both seminal and recent papers to summarize the reasons for these concerns, identify the work that has been done to help address or overcome them, and frame what we have come to view as the next major problem in this space: the invention of metrics describing the quality of a particular sanitization or anonymization technique on a given dataset. We find that network researchers face a dilemma: although they can hy-

pothesize about network data properties and prototype sanitization tools, they find it difficult to obtain real network traces to test these hypotheses and tools and verify whether they are correct, or to operate with any utility on real-world networks, respectively. Fortunately, network research is far from stagnant because researchers have put a significant amount of effort into (or find creative ways of) obtaining access to large, meaningful traffic traces from real production networks.