Typological Challenges for the Application of Multilingual Language Models in the Digital Humanities

doi:10.4324/9781003393696-13

Chapter

Typological Challenges for the Application of Multilingual Language Models in the Digital Humanities

ABSTRACT

This chapter explores the challenges faced by multilingual digital humanities (DH) researchers when using natural language processing (NLP) techniques in their work. While recent developments in neural language models have improved the usefulness of NLP tools for DH experts working with complex datasets, there is a bias towards languages in the Global North, particularly English. This bias arises from data imbalance, unequal script, and vocabulary coverage, and differences in linguistic typology, leading to poor performance for most of the world’s languages. To address these issues, the chapter proposes collaboration between NLP and DH research efforts:

Data-efficient fine-tuning and open data creation, for instance by following the FAIR principles.

Custom model vocabularies where DH datasets and expertise might prove useful.

Careful consideration of the semantic and morphosyntactic properties of the languages at hand.

By integrating NLP techniques and DH expertise more successfully, the chapter argues that linguistic diversity can be better addressed, and fairness can be promoted in the application of NLP and multilingual DH across linguistic communities.