ABSTRACT

3.1 Introduction

An important problem related to mining large data sets, both in dimension and size, is of selecting a subset of the original features [66]. Preprocessing the data to obtain a smaller set of representative features and retaining the optimal salient characteristics of the data not only decrease the processing time but also leads to more compactness of the models learned and better generalization. Dimensionality reduction can be done in two ways, namely, feature selection and feature extraction. As mentioned in Section 1.2.2, feature selection refers to reducing the dimensionality of the measurement space by discarding redundant or least information carrying features. One uses supervised feature selection when class labels of the data are available; otherwise unsupervised feature selection is appropriate. In many data mining applications class labels are unknown, thereby indicating the significance of unsupervised feature selection there. On the other hand, feature extraction methods utilize all the information contained in the measurement space to obtain a new transformed space, thereby mapping a higher dimensional pattern to a lower dimensional one.