Pre-Processing of Dogri Text Corpus

doi:10.1201/9781003052098-24

ABSTRACT

Pre-processing is the most important task for any Natural Language Processing (NLP) application. Pre-processing can include a variety of steps such as tokenization, stemming, stop-word removal, lemmatization, POS tagging etc. Dogri is one of the under-resourced language included in the 8^th schedule of the Indian Constitution. The creation of Dogri Language corpus in itself is a challenging task due to non-availability of the digitalized resources. In this paper, the methodology used for Dogri corpus creation is presented. Also, the pre-processing tasks like tokenization and stop-word removal are taken up. Dogri Corpus consists of collection of newspaper articles from the Dogri newspaper. Around 276 articles are considered to be included in the corpus. As Dogri is written using the Devanagari script, the delimiter ‘I’ is used for tokenization. An algorithm based on frequency accompanied with the named-entity list has been proposed for stop-word list generation. This approach leads in generation of stop-word list consisting of 155 stop-words which can be used for stop-word removal. The effect of stop-word removal on document size has also been presented.