ABSTRACT

Feature selection is an integral component of knowledge discovery and machine learning. It helps build robust and cost-effective learning models for the extraction of interesting hidden patterns by selecting a subset of relevant features. Feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is a multistep process particularly useful in analyzing real-life highdimensional data such as network intrusion data, biological data and criminal investigation data. In supervised classification, a feature selection method may be useful in improving performance of the learning model in several ways. It helps alleviate the effect of the high dimensionality problem. It also enhances the generalization capability as well as provides for speedier learning. Finally, feature selection also helps acquire better understanding of the data by discovering important features and how they are related to one another. Feature selection has been the focus of interest for quite some time and substantial work is available. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems have arisen and novel approaches for feature selection are in demand. This chapter is a comprehensive review of many existing approaches, methods and tools from the 1970s to the present. It identifies four steps in a typical feature selection method, categorizes existing methods in terms of generation procedures and evaluation functions and also discusses combinations of generation procedures and evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion via example. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses of the methods are explained. Guidelines for applying feature selection methods are given based on data types and domain characteristics. This chapter identifies future research areas in feature selection, introduces newcomers to this field and paves the way for practitioners who need suitable methods for solving domain-specific

real-world applications.