ABSTRACT

This chapter examines the usual criteria intervening in good corpus building – i.e., size, representativeness, authenticity and balance – in order to show that the MetCLIL corpus is based on sound and solid CL principles. Next, I describe the key variables defining how the events to be added to MetCLIL were selected with particular attention to the definition of the academic seminar as a genre as well as a description of the institutions, countries and participants included in the corpus. Next, the details of the data collection process are explained in order to illustrate some of the decisions researchers have to make regarding the recruitment of participants and the recording of speech events. The next two sections are devoted to the processes immediately following the recording phase of any spoken corpus, i.e., transcription and tokenisation. Thus, the decision to adopt the guidelines issued by VOICE, regarding the way recordings should be rendered, is explained, followed by a full account of how these guidelines were adapted to MetCLIL, taking into account that the two corpora have completely different sizes and objectives. The VOICE transcription criteria adopted are classified according to the type of data that the research team incorporated into the transcription: non-verbal and verbal data together with contextual and structural mark-up. Next, the important process of removing any identifying details of the participants is addressed. Finally, the often-forgotten issue of what the corpus identifies as a token unit is addressed given its implication in important metaphor analyses such as those measuring metaphor density (cf. e.g., Nacey, 2013).