ABSTRACT

The CPU time required by clustering algorithms is unacceptable when large data sets are involved. This is due largely to the time required to compute nearest neighbours. This chapter discusses some methods to find nearest neighbours efficiently and then to implement some of these methods to clustering large data sets. It investigates the efficiency of variant methods when applied to clustering large two-dimensional data sets in (x, y) space. The methods are first tested on different data sets to find nearest neighbours. Then, they are used within the k-means clustering algorithm. The chapter also discusses three widely used methods: k-d trees, quadtrees and the cell method. These methods are essentially aiming to reduce the time spent on the search for nearest neighbours. The chapter presents a modification of the cell method. This modification leads to further savings in CPU time when applied to clustering problems.