ABSTRACT

Any text-based system requires some representation of documents, and the appropriate representation depends on the kind of task to be performed. Content-based text processing systems can be broadly classified into classification systems and understanding systems. Text classification systems have been the primary focus of information retrieval (IR) researchers. These systems include text retrieval systems, which retrieve texts in response to a user query, as well as text categorization systems, which assign texts to one or more of a fixed set of categories. Text understanding systems go beyond classification to transform text in some way, such as producing summaries, answering questions, or extracting data.