ABSTRACT

The scarcity of multilingual corpora creates an empirical impasse to progress in the field of language contact. This chapter advances a data science approach that enables us to aggregate patterns from multiple small data sources. The approach allows linguists to work with raw data that are heterogeneous in terms of the languages they represent and the systems that are used to annotate them. We demonstrate that these small data sets can be transformed in such a way that they can be reliably analyzed and compared to formulate data-driven models of language mixing.