Data Mining | 6 | Understanding Complex Datasets

ABSTRACT

When data was primarily generated using pen and paper, there was never very much of it. The contents of the United States Library of Congress, which represent a large fraction of formal text written by humans, has been estimated to be 20TB, that is about 20 thousand billion characters. Large web search engines, at present, index about 20 billion pages, whose average size can be conservatively estimated at 10,000 characters, giving a total size of 200TB, a factor of 10 larger than the Library of Congress. Data collected about the interactions of people, such as transaction data and, even more so, data collected about the interactions of computers, such as message logs, can be even larger than this. Finally, there are some organizations that specialize in gathering data, for example NASA and the CIA, and these collect data at rates of about 1TB per day. Computers make it easy to collect certain kinds of data, for example transactions or satellite images, and to generate and save other kinds of data, for example driving directions. The costs of storage are so low that it is often easier to store ‘everything’ in case it is needed, rather than to do the work of deciding what could be deleted. The economics of personal computers, storage, and the Internet makes pack rats of us all.