ABSTRACT

This chapter takes the readers on a short safari into the jungle of clustering in static big data (aka massive data, huge data, etc.). Unlike real jungles on Planet Earth, the big data jungle is growing like a teenager juiced with steroids. But concerns about how to process big data sets are hardly new. Hill provided a lucid discussion about scaling up computational methods more than 30 years ago. Scalability for clustering algorithms applied to big data is often confused with the acceleration of clustering algorithms that are (or were) developed for small data. The use of samples is a natural way to attack the problem of how to find clusters in big data. There are many ways to get samples and many ways to use them. There are three main issues involved when constructing samples: how to get the samples; how many samples to collect; and how to evaluate their quality for a specified purpose.