ABSTRACT

This chapter introduces the basic workflow adopted in corpus-based research and explores its components to the relevant types of data as well as to the kinds of theoretical sources that inform the micro-models in different stages of processing/analysis. It discusses a standard linguistic model for parts-of-speech, including a historical perspective, and describes in more detail the role of modeling assumptions in computational approaches to part-of-speech tagging. The chapter outlines the implications of the perspectives on modeling in the language- and text-oriented humanities more widely. The predominant domain of corpus linguistics is language variation, aiming at statements on relative differences/similarities between linguistic varieties. Linguistics is concerned with modeling language from the cognitive, social, and historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. Traditionally, linguistics is concerned with classification—that is, abstracting from observations of linguistic instances to classes.