ABSTRACT

This chapter provides a detailed overview of some of the issues related to the design and construction of spoken corpora. It begins with a broad overview of some existing current spoken corpora, with some examples of what they may be used for. The chapter reviews of some of the practical and technical challenges when embarking on creating spoken language corpora: from sourcing and recording data (and metadata), including novel crowdsourcing approaches; to transcribing, coding and marking up datasets, through to some considerations of how to represent spoken data for subsequent analysis. The chapter includes discussions of considerations relating to ethics and copyright when distributing and sharing spoken corpora.