ABSTRACT

Many existing databases are unlabeled, because large amounts of data make it difficult for humans to manually label the categories of each instance. Moreover, human labeling is expensive and subjective. Hence, unsupervised learning is needed. Besides being unlabeled, several applications are characterized by high-dimensional data (e.g., text, images, gene). However, not all of the features domain experts utilize to represent these data are important for the learning task. We have seen the need for feature selection in the supervised learning case. This is also true in the unsupervised case. Unsupervised means there is no teacher, in the form of class labels. One type of unsupervised learning problem is clustering. The goal of clustering is to group “similar” objects together. “Similarity” is typically defined in terms of a metric or a probability density model, which are both dependent on the features representing the data.