A Study of Proximity of Domains for Text Categorization

doi:10.1201/9780429277573-7

ABSTRACT

This chapter aims to present a novel approach to text representation for the automatic classification of Bangla text documents obtained from various web sources. Text classification or Text categorization (TC) is a major challenge for Indian languages, especially for Bangla, because of its complex morphology. The considerable growth in the presence of Bangla text documents on the web has made it necessary to develop an automatic text classification system based on some classification techniques. Text classification in Bangla demands considerable work to analyze the content of text documents, where matching the terms with bags of words (BoWs) alone will not be sufficient. In this chapter, the unique bag of words (UBoWs) model and the proximity between the text categories or domains being considered are used to increase the efficiency of the automatic text classification system. The proximity of categories is determined based on the UBoWs model. After observing the depicted outcome, some of the proximity results show a more reasonable output than others that hardly upgraded the classification rate. Thus, a scoring algorithm is employed to assign different scores to the proximity of domains based on the frequency of the closeness in the corpus. After developing the feature set, a naive Bayes multinomial is used as a classifier. The proposed model is validated and evaluated using other widely recognized classifiers on the obtained datasets. The performance of the system is presented in terms of precision, recall and F₁ measure. Besides classifying the text documents into their respective text categories or domains, this model also clearly states the proximity of one domain from another which helps in understanding the relationship between them.