Information Retrieval Methods for Big Data Analytics on Text

doi:10.1201/9780429446177-4

ABSTRACT

This chapter focuses on different approaches to retrieve text document in some order from a vast pool of unstructured textual data. It addresses some of the concerns related to capturing the intent of the user and developing a context-based information retrieval system. The chapter introduces some basic concepts related to the vector space model and how words are represented for deriving word embeddings. It discusses term-frequency approach to create the vectors for each document. The chapter also discusses the basics of vector space model and its need for mining textual data for extracting meaningful insights in terms of information retrieval, classification, or sentiment analysis. Document-term matrix is a numerical representation of terms present in a corpus in a vector format. The goal of text categorization is to classify the topic or theme of a document. One of the important applications is to classify news in different sections, such as sports, politics, financial news, and country-specific.