ABSTRACT

In this chapter, I analyze a linguistic tradition that has tried to document its linguistic theories, and to contextualize the linguistic facts of their proposals. The interest in this tradition 206is motivated by the fact that it is its theories and methodologies that contributed to establish the foundation of current corpus linguistics. Likewise, I analyze two new aspects that, at present, have contributed to the development of corpus linguistics, specifically, (1) the explicit intention of cognitive linguistics to study language based on discourses produced in real communicative contexts, and (2) the possibility to process semantically and syntactically texts formed by billions of words, currently offered by computational linguistics. Secondly, in this paper, I present a Spanish POS tagger and lemmatizer, which uses an electronic dictionary of Spanish of 634.000 word forms, which are automatically generated by expanding a dictionary of 114.000 lemmas. The electronic dictionary of word forms includes both single words as well as multiword idioms. Chunking is carried out by intersecting the output of the tagger and lemmatizer with finite state transducers, which specify the structure of noun phrases, prepositional phrases, etc. The chunking process is also used to create subcorpora of sentences, where predicates appear in specific constructions. The application of frame semantics allows us to create specific subcorpora, according to the way semantic roles are mapped into sentence constituents. Finally, we show the link between this semantico-syntactic approach and the lexicographic tradition started by Cuervo, who defines the meaning of lexical entries, based on the semantic contribution they make to the constructions where they appear.