ABSTRACT

Data in the digital world are increasing at a tremendous pace. A report by the International Data Corporation (IDC) estimates that by 2011, the data storage requirement would grow up to 2 zettabytes (2021 bytes). By 2011, every household can have over 1 TB of data footprint, which can be stored on personal computers (PCs) or laptops, external storage devices, and digital data from smart phones or cameras. This increase is attributed to widely used applications that are becoming increasingly popular suchasmedia and Internet applications includinge-mail, searchengines, social networking sites andphoto (Flickr) and video sharing (YouTube), database applications that generate structured data, applications such as Office tools including presentations, spread sheets, or documents that create unstructured data, data-intensive applications such as animation rendering and scientific computing, or even digitalized data created by personal devices such as cell phones, smart phones, or personal digital assistants (PDAs). Since storage capacities are cheap ($0.21/GB for hard disks), once created, most of these data are never destroyed, even after their utility ends. People continue to archive their personal outdated data on the local disk drives on their PCs or on the cloud. Organizations transfer outdated data on archival storage for future reference if the need arises. Further, generating duplicates of existing data is fairly common. For instance, copies of data are maintained, possibly in distant geographical locations, for backups to avoid loss of data due to system failures or natural disasters. Data can also be duplicated frequently by e-mail servers that handle content such as e-mail messages or attachments, and several versions of such content that are vastly similar may exist.