ABSTRACT

This chapter provides an overview of general purpose document clustering and focuses on advancements in the next frontier in document clustering: long and short documents. The proliferation of documents, on both the Web and in private systems, makes knowledge discovery in document collections arduous. Clustering has been long recognized as a useful tool for the task. While most document clustering research to date has focused on moderate length single topic documents, real-life collections are often made up of very short or long documents. The clustering algorithm and the measure used to compute similarity between documents is highly dependent on the chosen document model. Some document models have been proposed to overcome vector space model limitations. Some models build corpus representations that allow computing semantic similarity between documents. The Generalized Vector Space Model addresses the pairwise orthogonality assumption in the vector space model.