ABSTRACT

Words are the building blocks of natural language texts. As a proportion of a text’s words are morphologically complex, it makes sense for text-oriented applications to register a word’s structure. This chapter is about the techniques andmechanism for performing text analysis at the level of the word, lexical analysis. A word can be thought of in two ways, either as a string in running text, for example, the verb delivers; or as a more abstract object that is the cover term for a set of strings. So the verb DELIVER names the set {delivers, deliver, delivering, delivered}. A basic task of lexical analysis is to relate morphological variants to their lemma that lies in a lemma dictionary bundled up with its invariant semantic and syntactic information. Lemmatization is used in different ways depending on the task of the natural language processing (NLP) system. In machine translation (MT), the lexical semantics of word strings can be accessed via the lemma dictionary. In transfer models, it can be used as part of the source language linguistic analysis to yield the morphosyntactic representation of strings that can occupy certain positions in syntactic trees, the result of syntactic analyses. This requires that lemmas are furnished not only with semantic but also with morphosyntactic information. So delivers is referenced by the item DELIVER + {3rd, Sg, Present}. In what follows we will see how themapping between deliver and DELIVER, and the substring s and {3rd, Sg, Present} can be elegantly handled using finite state transducers (FSTs).