ABSTRACT

Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, and Suvrit Sra

There is a long-standing folklore in the information retrieval community that a vector space representation of text data has directional properties, i.e., the direction of the vector is much more important than its magnitude. This belief has led to practices such as using the cosine between two vectors for measuring similarity between the corresponding text documents, and to the scaling of vectors to unit L2 norm (41; 40; 20).