Automatically compiling bilingual legal glossaries based on Chinese–English parallel corpora

doi:10.4324/9781003006688-14

Chapter

Automatically compiling bilingual legal glossaries based on Chinese–English parallel corpora

ABSTRACT

Bilingual legal glossaries are indispensable to translators in specialized and technical translations. However, manual construction of such resources is very time-consuming. This chapter explores the methods of automatically compiling bilingual legal glossaries based on parallel Chinese–English legal corpora of 600,000 words by integrating state-of-the-art tools in machine translation and natural language processing. It investigates the pros and cons of linguistic and statistical approaches to bilingual terminology extraction as well as neural and phrase-based statistical machine translation systems. The proposed system employs Chinese and English noun phrase recognizers, a customized phrase-based statistical system based on Moses, bilingual word and n-gram alignment tools, Google Translate, and partial matching. Our experiment shows that for a corpus size of 600,000 words, the phrase-based statistical machine translation toolkit Moses outperformed neural machine translation such as OpenNMT and Google Translate in deriving Chinese–English bilingual terminologies. Our study suggests that while terminologies tend to be fixed in the source language, their translations seem less rigid. In addition, OpenNMT, a neural MT toolkit, is found to be more sensitive to the size of the training corpus and the length of text alignment than a phrase-based statistical machine translation system such as Moses. Finally, error analysis of our customized MT system suggests that there are more missing words than redundant words.