Métodos de clasificación automática de textos para el español

doi:10.4324/9780429329296-37

ABSTRACT

Text classification or categorization is one of the fundamental issues due to which text linguistics and corpus linguistics were originated; also, it has long been an objective of linguistics itself. On the other hand, the intense and rapid development of digital technologies, social networks, and other social media has produced a phenomenon of big data: enormous volumes 495of data have been accumulated in the digital format. The major part of such data consists of texts in natural language whose manual classification is no longer feasible; therefore, there emerged a need to develop robust automatic classification methods. Therefore, we present the topic of text classification within the framework of corpus linguistics. In general, text classification is defined as assigning a text to one or more pre-defined classes. Commonly, classes are topics, for example, headings of newspaper sections: world, politics, business, science, health, sport, arts and travel, among others. However, any other concept of interest can be assigned to a class: genres, emotions, time of writing and events, author’s name, age, gender, or language, to mention a few. To automate the classification process, a wide range of computational techniques has been developed. In this chapter, we first offer a panoramic view of these techniques as well as present latest advances in classification of texts in Spanish corpora and then discuss some classification methods in more detail. Describing the performance of each classifier, we analyze its strengths and limitations referring to recent state of the art experimental results.