Principal Component Analysis for Information Retrieval

doi:10.1201/9781420028683-20

ABSTRACT

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 13.2 Related Eigensystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 13.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 13.4 Lanczos Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 13.5 Parallelization of the Lanczos Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 13.6 Benchmark Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 13.7 Testing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 13.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 13.9 Summary and Future Directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

Advances in computational statistics and high-performance computing have significantly reduced the computational burden of converting raw text to encoded vectors for the purpose of information retrieval (IR) modeling and testing. Using principal component analysis (or PCA) to encode large text collections into low-dimensional vector spaces within a few minutes is an important advancement in the study and development of IR models such as latent semantic indexing (LSI). The development and tuning of parallel Lanczos algorithms for computing the singular value decomposition (SVD) of sparse term-by-document matrices has greatly facilitated the use of LSI-based models for large-scale text mining and indexing applications. However, future software development (in languages such as C++) is still needed for efficient up-and down-dating of low-rank singular subspaces. Time-sensitive or dynamic text collections require regular updates and simply computing the principal components of an updated (even larger) term-by-document matrix may be unwieldy with limited RAM. Exploiting distributed storage (Internet-based) may well be the next advancement for vector space IR models. Storing (and replicating) the principal components or singular subspaces themselves on distributed (remote) servers could significantly increase query response time. Through parallel processing, the fusion of separate information sources into one global index can be accomplished.