ABSTRACT

This chapter argues that it is possible to improve spoken language translation quality by enriching textual training data with comparable corpora. It examines building and exploring comparable corpora in search for parallel data. The chapter suggests that training parameters must be adapted for each language and text domain independently. It shows that improvements to corpora quality are essential to spoken language statistical machine transition. Methods for comparable corpora exploration were developed and tested. The Yalign tool was adapted and greatly improved and almost 500,000 bi-sentences were successfully obtained from Wikipedia. The obtained data proved to be of acceptable quality and also improved the quality of the statistical machine translation (SMT) systems. A successful trial was made to create an evaluation metric more suited for morphologically-rich languages. Many pre-processing tools were implemented and used as well. Supplying SMT systems with neural-based language models has already proven to be a quality-improving approach.