ABSTRACT

High-dimensional data are commonly seen in many practical machine learning and data mining problems and present a challenge in both classification and clustering tasks. For example, document classification/clustering often deals with tens of thousands of input features based on bag-of-words representation (where each unique word is one feature dimension). In market basket data analysis, the input dimensionality is the same as the number of products seen in transactions, which can also be huge. Although there are already some algorithms that can handle high-dimensional data directly (e.g., support vector machines and na¨ıve Bayes models), it is still a good practice to reduce the number of input features. There are several good reasons for this practice: a) Many features may be irrelevant to or uninformative about the target of our classification/clustering tasks; b) reduced dimensionality makes it possible to use more choices of classification/clustering algorithms; and c) lower dimensionality is more amenable to computational efficiency.