Etiquetadores morfosintácticos para corpus en español

doi:10.4324/9780429329296-31

ABSTRACT

In languages of rich inflectional morphology such as Spanish, the automatic treatment of the internal information contained in words is essential. In the first place, the characteristics and problems of morphological taggers are addressed. Their function is to detect PoS, lemma and grammatical features for each word. The formal variation of words is responsible for its 405computational complexity. Any tagger has to recognize and generate only grammatical forms. Two serious problems are the “unknown” words (usually neologisms) and the multiwords (lexical units formed by more than one word).

The full listing of words, associated with their linguistic information, produces many homographs: the same wordform can have several morphosyntactic analyses. A disambiguation component to resolve homography is necessary for any tagger, which performs the selection of the appropriate analysis, according to the syntactic context of appearance. All current taggers use a statistical model for disambiguation, based on corpora hand-annotated by linguists. In the second part of the chapter, the different options that a corpus user faces will be dealt with. The performance of automatic taggers depends basically on the tagset (descriptive tags provided by the program) and the semantic domain or language register for which the disambiguation model is trained. Therefore, the choice of the most appropriate tagger for our corpus depends on these two factors. The best-known Spanish taggers and their use within corpus tools are analyzed in this chapter.