ABSTRACT

This chapter discusses the clustering process and the various types of clustering techniques such as partitional, hierarchical, and fuzzy clustering algorithms. It provides an overview on MapReduce programming model, Hadoop architecture, and RHadoop platform. The chapter also discusses different metrics to evaluate the performance of the serial K-means++ and parallel K-means++ algorithm on RHadoop platform using datasets of different sizes. The main goal is to model the serial K-means++ and parallel K-means++ algorithms as MapReduce tasks on RHadoop platform. The chapter presents a comparative review of serial K-means++ and parallel K-means++ on MapReduce paradigm using different datasets. It shows how initial centroid selection strategies help to improve the accuracy and the performance of conventional K-means algorithm. The chapter also presents the use of the principal component analysis technique to map the dataset to lower dimensions in order to plot the data points by utilizing a visualization tool called scatter plots.