Lexicography and corpus linguistics

doi:10.4324/9781315104942-9

ABSTRACT

This chapter explains past and present of corpus linguistics. Issues under consideration are representativeness, re-usability, corpus annotation and access to data. Corpus encoding describes the ways that linguistic annotations can be assigned to the tokens. Other annotation tools first produce XML annotations in the corpus file itself, others, more recent ones, generate standoff markup XML by first assigning sequences of byte ranges of the given document to token IDs. The first instruments to make corpus data available for lexicography were the so-called "Key Word In Context" (KWIC) tools, showing non-annotated text where the search-word is in a central position. The Open CorpusWorkbench (OCWB) not only offers tools to encode corpora, but also includes the Corpus Query Processor (CQP). Corpus linguistics has come a long way since its first appearance, and lexicographers using corpora have often been the ones who pushed the matter forward.