A survey on dimension reduction techniques in text classification : Zhi Juan Wang Ruo Song Zhou

doi:10.1201/b18828-141

ABSTRACT

Text classification is an important part of text mining, which makes great meaning in improving the speed of information retrieval and the accuracy of the rate. We usually select Token (word appearing in a text) to denote the text’s features space d=(T1,T2,…,Tn). Therefore, the dimension of features space ranging from thousands to ten thousands is so high that the traditional methods of classification can’t process those features. Under the circumstance, Dimension reduction is the primary task as well as the key question to be solved for text classification using the method of machine learning. Dimension reduction includes two methods: 1). Feature selection (including Document Frequency (DF), Mutual Information (MI), Correlation Coefficient (CC), χ2-statistics (CHI) [1-4]). 2). Feature –extraction (including Random Projection (RP), Latent Semantic Analysis (LSA), Concept Indexing (CI) [5-7]). Feature selection selects parts of meaningful features from the original features to form a new low-dimensional space without changing the property of original features space. However, Feature extraction projects the original features space onto a new feature space by structuring an evaluating function of features. The procedure of dimension reduction is shown in Figure.1.