A novel technique for script identification in trilingual optical character recognition

doi:10.1201/9781351124140-144

Chapter

A novel technique for script identification in trilingual optical character recognition

ABSTRACT

ABSTRACT: In multilingual environment searching, editing and storing of documents is made easier by script identification. It also an aid for selecting script specific Optical Character Recognition (OCR) for multilingual documents. India is a multilingual country so documents may contain more than one script. In Kerala, a state in India, the documents may contain text in three languages: Malayalam, the-official language of the state; Hindi, the national language; and English, the global language. For processing such multiscript documents, it is necessary to identify the script before feeding the text line to specific OCRs. This paper presents a novel and efficient technique for script identification in English, Hindi, Malayalam trilingual documents. Features for classification are extracted from horizontal projection of text images. Training and testing are done on our own data set developed from documents containing these three languages.