A Corpus Based Quantitative Analysis of Gurmukhi Script

doi:10.1201/9781003224068-12

ABSTRACT

The present study deals with the syntactic aspects of the Gurmukhi script by applying several standard statistical measures. The analysis is performed on the text written in seven distinct genres, which amount to >6 million words and >440 thousand sentences. The assessment of the textual data is performed at two syntactic levels—words and sentences. Revelations are made using statistical techniques on parameters such as word length, character frequency, vowel usage, word frequency, word length frequency, type token ratio (TTR), characters usage in a sentence, words usage in a sentence, words usage after the removal of stop-words in the sentence, characters usage after the removal of stop-words in the sentence, and correlation. This manuscript reveals the hidden facts of the Gurmukhi script and lays the groundwork for future research in the quantitative linguistics research.