ABSTRACT

This chapter examines closely the computational methods for empirical collocation extractions that are widely used in corpus-based studies, sometimes without proven efficiency. A great deal of effort has been invested in automatic collocation extraction from corpora. Evaluating collocation extraction - comparing it to morphological taggers or syntactic parsers - is a challenging task for many reasons. First, any lexicon is much larger than a repertoire of grammatical features. Even in Russian, which is a morphologically rich language, there are only 156 morphosyntactic features, whereas the number of lemmas is a thousand-fold. One of the main difficulties in dealing with collocation extraction is that collocability has different realizations in a language: some collocations emerged because they are frequently used, even if both collocates have an open distribution and may be frequently combined with other words. All standard methods are close to one another and produce intersecting results.