A Study of Distance Metrics in Document Classification

doi:10.1201/9780429277573-6

ABSTRACT

An automated text classification system becomes a necessity due to the availability of the huge amount of text documents that we encounter in our day to day life. It is required to manage and process an increasingly large number of extensive documents in digitized ways. An automated text classification system is appealing due to its ability to eliminate the processing of text documents manually, which is an expensive, tedious, time-consuming task and not feasible at all with respect to the time being provided and the number of text documents involved. Automatic text classification or text categorization is the task of automatically assigning a set of text documents into their respective text categories or domains from a predefined set of domains. It has various major applications, such as the filtering of spam emails, classification of opinions for movie or product reviews, indexing articles, analysis of sentiments from texts, hierarchical classification of web data and many more. Often, for classification purposes, the relationship between two text documents is measured using distance metrics. Several classification techniques have already been proposed and implemented along with distance metrics, such as cosine similarity and Euclidean distance for accomplishing this task. However, in this chapter, an approach is taken where various innovative distance measurement algorithms are analyzed, namely squared Euclidean distance, Manhattan distance, Mahalanobis distance, Minkowski distance, Chebyshev distance, and Canberra distance to evaluate the effectiveness of the algorithms for classification of 9000 Bangla text documents acquired from various web sources. The outcomes are presented in terms of precision, recall and F₁ measure. The obtained results are compared with other commonly used distance measurement algorithms and it is observed that Mahalanobis distance and Minkowski distance perform better compared with all other distance metrics adopted for this experiment.