Noise Removal as Pre Processing Task and its Implementation for Gujarati Named Entity Recognition

doi:10.1201/9781003052098-29

Chapter

Noise Removal as Pre Processing Task and its Implementation for Gujarati Named Entity Recognition

ABSTRACT

In current periods there has been a considerable amount of work to improve IR (Information Retrieval) systems for Indian languages other than English. A most commonly used resource in Information Retrieval is the stopword list. Till date, there is no regular stopword list which has been formed for Gujarati language and whatever list which is existing is very incomplete and imperfect. As digitalized texts from numerous authentic resources were gathered, and used to support into the algorithm execution. In this paper, we have suggested a dictionary based method to make a standard stopword list for Gujarati language. We have formed a bulky stopword list for Gujarati language and this list comprises more than 800 stopwords. And at last the precision of algorithm for eliminating highest common stopwords from document exclusively depends on stopword list used for it. By implementing algorithm, we achieved 10% – 25% file size reduction which further will increase system accuracy and overall performance.