chapter  11
Text Classification
ByCharu C. Aggarwal, ChengXiang Zhai
Pages 50

The problem of classification has been widely studied in the database, data mining, and information retrieval communities. The problem of classification is defined as follows. Given a set of records D = {X1,…,XN} and a set of k different discrete values indexed by {1…k}, each representing a category, the task is to assign one category (equivalently the corresponding index value) to each record Xi. The problem is usually solved by using a supervised learning approach where a set of training data records (i.e., records with known category labels) are used to construct a classification model, which relates the features in the underlying record to one of the class labels. For a given test instance for which the class is unknown, the training model is used to predict a class label for this instance. The problem may also be solved by using unsupervised approaches that do not require labeled training data, in which case keyword queries characterizing each class are often manually created, and bootstrapping may be used to heuristically obtain pseudo training data. Our review focuses on supervised learning approaches.