ABSTRACT

Document similarity computation is an important and fundamental issue to many text-analysis applications, including information retrieval, document classification, and document clustering. Choosing a good similarity measure is no less important than choosing a good document representation. The similarity between two documents is computed with one of the several similarity measures based on the two corresponding feature vectors, e.g., cosine measure, Jaccard measure, and euclidean distance.