ABSTRACT

In everyday life and work, including in software development, we often need to determine whether an element is in a set. For example, in word processing softwares, we need to check whether an English word is spelled correctly (whether it is in a known dictionary); the FBI needs to verify whether a suspect's name is on the suspects' list; a web crawler needs to confirm whether a site has already been visited, etc. The most direct solution is to save all of the elements in the set on the computer, and compare each new element against all existing elements in the set. Generally speaking, a set is stored in a hash table on a computer. The main advantages of a hash table are its speed and accuracy, but these come at the expense of storage space. When the set is relatively small, this problem is not apparent. However, when the set is extremely large, the low space efficiency problem of hash tables becomes obvious. For example, public email providers Yahoo, Gmail, and Hotmail have to set up a system to filter out junk emails from spammers. One solution is to record the sender addresses of all spam emails. Because spammers will keep making new email accounts, there are at least billions of spam addresses all around the world that require many servers to store. Using hash tables, we need 1.6 GB of storage for every hundred million email addresses. (The specific implementation is to store 8-bit information fingerprints that correspond to email addresses). Because the storage efficiency of hash tables is only 50%, each email address requires 16 bits of storage. A hundred million email addresses would take up 1.6 GB of storage space. Hence, storing billions of addresses could require hundreds of gigabytes of storage. Other than supercomputers, average servers cannot store such a large number of addresses.