A Graph-Based Text Classification Model for Web Text Documents

doi:10.1201/9780429277573-5

ABSTRACT

Internet technology has brought a substantial change to our day to day life. Nowadays, we need several digital means to automatically manage, filter, store and retrieve information from text documents. Automated text document analysis and mining are becoming an essential part of computer applications, and thus various classification and clustering approaches are required for carrying out these tasks. The classification of the documents needs to be performed over training datasets, which are further used to train the model to classify text documents into their respective text categories or domains. Thus, text analysis becomes one of the major aspects of text data mining. This chapter presents a graph-based text classification approach for analysing web text documents obtained from various online sources. Sets of text documents are represented as graph sets to which an algorithm of a weighted graph is implemented to extract subgraphs that are further used to develop feature vectors for each text document being classified. A weighted graph algorithm is chosen to extract the most relevant subgraphs, which in turn increases the classification and computational efficiency. Contrary to the traditional vector space model and bag-of-words approaches, the graph-based model performs more efficiently by considering the structural information of the text documents. Graph-based models are becoming an alternative way of representing text because of the ability to encapsulate important facts related to the text, such as the ordering of a term, co-occurrence of terms and relationships among different terms. The proposed model is validated and evaluated by using several popular classification algorithms on the obtained datasets. The performance of the system is stated in terms of precision, recall and F₁ measure. Compared with different methodologies adopted by researchers, the proposed approach is found to be accurate and more efficient for classifying text documents of multiple categories.