ABSTRACT

The last five or ten years have seen the release of several very large corpora of Spanish (defined here as corpora with at least 100 million words in size): corpora from Sketch Engine, corpora from the web, the Corpus del Español, and the Real Academia Española. As Davies discusses (regarding English corpora), there is a fairly close relation between corpus size and the range of linguistic phenomena that can be explored. A one- or two-million-word corpus is best for very high frequency phenomena, such as discourse markers, prepositions, or frequent syntactic constructions (passive, perfect, or progressive). But small corpora such as these are often quite inadequate for detailed investigation of lexis, morphology, collocational preferences or medium and low-frequency syntactic constructions (such as verbal complementation). In addition, very large corpora—if they have the right type of architecture and interface—can also provide valuable insight into variation between genres, dialects, and time periods. In this chapter, we will first outline (in Section 1) the corpora that are the focus of our discussion – by discussing their size, composition, time periods, dialects, and so on. Section 2 will discuss some basic issues of annotation for these large corpora—mainly part-of-speech tagging and lemmatization—as well as the accuracy of annotation in the different corpora. Section 3 will provide a more detailed examination of several different aspects of corpus functionality and will discuss how the size and composition of the corpora impact each of these. These will include: (i) frequency lists, including matching strings and full wordlists; (ii) grammatical constructions (with part-of-speech and lemma); and (iii) collocates (to examine word meaning and usage).

Section 4 will consider how large corpora can be used to examine genre-based, dialect and historical. Section 5 considers very briefly the topic of large, parallel corpora that can be used to compare usage and find translations in two languages, and then offers some concluding remarks and projections.