ABSTRACT

India is the epicentre of diversity with over 456 spoken languages and 22 constitutional languages spanning four language families. Although there is a growing population of users consuming content in Indian languages, the NLP ecosystem of Indian languages is poorly developed. This is owing to the morphological complexities of these languages, lack of documented standards and consolidated efforts, unavailability of large-scale datasets, and NLP models and infrastructural limitations. This chapter surveys the existing NLP packages for Indian languages, their architecture, workflow, usages and limitations, adaptability, cross-lingual information retrieval and standardization, and evaluation procedures. It also investigates the possibilities of maximal sharing of cross-lingual features between the related Indic languages and proposes a workbench for a multilingual model that can understand and transfer knowledge across multiple Indian languages.