chapter  2
Feature Selection for Clustering: A Review
BySalem Alelyani, Jiliang Tang, Huan Liu
Pages 32

The growth of the high-throughput technologies nowadays has led to exponential growth in the harvested data with respect to dimensionality and sample size. As a consequence, storing and processing these data becomes more challenging. Figure (2.1) shows the trend of this growth for UCI Machine Learning Repository. This augmentation made manual processing for these datasets impractical. Therefore, data mining and machine learning tools were proposed to automate pattern recognition and the knowledge discovery process. However, using data mining techniques on ore data is mostly useless due to the high level of noise associated with collected samples. Usually, data noise is either due to imperfection in the technologies that collected the data or to the nature of the source of this data. For instance, in the medical images domain, any deficiency in the imaging device will be reflected as noise in the dataset later on. This kind of noise is caused by the device itself. On the other hand, text datasets crawled from the Internet are noisy by nature because they are usually informally written and suffer from grammatical mistakes, misspelling, and improper punctuation. Undoubtedly, extracting useful knowledge from such huge and noisy datasets is a painful task.