ABSTRACT

In recent years, foreign language acquisition studies and Information and Communication Technologies (ICT) research have turned their interest towards the compilation of learner corpora, both for the need of empirical data about learners’ proficiency in the use of the foreign language and for the possibility of incorporating such databases into the development of computer-assisted language learning (CALL) applications. In the case of the acquisition of a foreign language sound system and its phonological rules, the learners’ utterances should be represented by means of a symbolic system in order to show their actual pronunciation of words. In written corpora, a common distinction is drawn between the source data, the text, and other pieces of information included to describe the original text, i.e., the annotations. In spoken corpora, by contrast, the linguistic content of the data is not directly accessible, therefore, a representation of the speech in a symbolic form is needed to process and analyze the data, viz. a transcription. Hence, transcriptions can be considered linguistic annotations needed in spoken corpora to represent speech in an abstract way, but they must not be confused with the original speech data. Sometimes the transcription is treated as the speech itself, and this misconception can lead to overgeneralizations about the data (Cucchiarini, 1993; Gibbon, Moore, & Winski, 1998).