ABSTRACT

This chapter argues that language corpora are often annotated with linguistic information such as parse trees or part-of-speech (POS) tags. The problem of annotating 'messy' spontaneous spoken language like the CallHome Japanese (CHJ) data is particularly challenging, but nevertheless has begun to attract interest in the language research community. An important resource for our linguistic annotation work was NTT's Goi- Taikei electronic Japanese dictionary. Goi-Taikei (GT) is a 400,000-word semantic dictionary that was developed by NTT for machine translation applications. The POS column of the lexicon specifies one or more part-of-speech categories for each lexical entry. The POS categories are taken from the inventory of 60 POS tags used by the Linguistic Data Consortium in their original Japanese lexicon. GT's semantic ontology was designed to enumerate and classify those concepts necessary for expressing relationships between words. In the GT dictionary, as in the CHJ lexicon, homonyms and homophones are listed as separate lexical entries.