Big data and word frequency : Measuring the consistency of Russian corpora

doi:10.4324/9781315105048-2

ABSTRACT

Quantitative methods in linguistics have a long history. Corpus linguistics, a branch of linguistics that deals with building corpora and investigating their data, has celebrated its fifty-fifth anniversary since the appearance of the Brown corpus. Large automatically or semi-automatically compiled corpora exceeding 100 million tokens appeared in early 2000s. The idea of creating such large text collections is closely related to the technical resources and to the gradually changing paradigm of corpus linguistics as it moves away from a "manual" approach to more automated methods. The analysis of corpora data leads to the general conclusion that the two corpora are largely similar in featuring syntactic relations. The majority of Russian texts in web corpora come from news websites, blogs, commercial websites, social media groups, and the like. The general conclusion obtained from the data suggests that texts selected for large corpora feature the language of the web.