ABSTRACT

Current research in Natural Language Processing (NLP) tends to exploit corpus resources as a way of overcoming the problem of knowledge acquisition. Machine Translation (MT) is no exception to this trend. Many MT researchers have attempted to extract knowledge from parallel corpora (Brown et al., 1990; Dagan, Itai, and Schwall, 1991; Matsumoto, Ishimoto, and Utsuro, 1993). A parallel corpus is composed of a pair of texts in two languages, where one is a translation of the other (Melamed, 2000). For MT systems, Arabic remains somewhat challenging. It is a highly inflected language with a rich and complex morphological system, where often a single word will consist of a stem along with multiple affixes and clitics. An Arabic word could stand as a complete sentence. Furthermore, because Arabic is written without diacritical marks or short vowels, Arabic words generate a huge level of ambiguity (Maamouri, Bies and Kulick, 2006). This chapter outlines a method for extracting translation equivalents from a parallel corpus for potential use in a broader Arabic–English MT, with a focus on investigating the effect of using a number of preprocessing tools, namely stemming and part-of-speech tagging on the extraction process (the method is applicable to any language pair). The method uses the statistical technique of co-occurrence frequency in the parallel corpus, and focuses on the open-class translation equivalents, excluding closed-class words, such as prepositions, conjunctions, particles, etc. The automatic extraction of such translation lexicons is advantageous as it saves time and effort. A number of preprocessing steps are carried out. First, Part-of-Speech (POS) taggers for Arabic and English are built to tag the parallel texts. Second, stemmers for Arabic and English are developed to reduce a word to its stem, base or root form. The stemmers and POS taggers are then used to annotate the parallel corpus, indicating both POS tags and stem forms for both Arabic and English words. Now there are two parallel corpora; the first is a raw corpus without any linguistic information about the stems and POS categories and the other corpus is annotated with such information. The extraction method is applied to both corpora to see how effective the use of stemming and POS-tagging is on the process of selecting the correct translation equivalent. Experimental results show that the accuracy of the extracted equivalents improved by roughly 30% after using both stemming and POS-tagging techniques.