ABSTRACT

Machine learning and deep learning models for text are executed by computers, but they are designed and created by people using language generated by people. As natural language processing practitioners, we bring our assumptions about what language is and how language works into the task of creating modeling features from natural language and using those features as inputs to statistical models. We can improve our machine learning models for text by heightening our knowledge of how language works. Linguistics is exactly that, the study of how language works, and while NLP practitioners don't need to be experts in linguistics, learning from such domain experts can improve both the accuracy of our models and our understanding of why they do (or don't!) perform well. Predictive models for text reflect the characteristics of their training data, so differences in language over time, between dialects, and in various cultural contexts can prevent a model trained on one data set from being appropriate for application in another. A large amount of the text modeling literature focuses on English, but English is not a dominant language around the world.