Annotating disfluencies in spontaneous Japanese : A corpus-based study

doi:10.4324/9781003648369-3

ABSTRACT

This chapter discusses how to annotate disfluent phenomena in spontaneously spoken Japanese. Historically, descriptive studies of the Japanese language using spoken corpora date back to the 1950s at the National Institute for Japanese Language and Linguistics (NINJAL). By recording 30 hours of daily conversations and 9.5 hours of monologues, all utterances, including various disfluencies, were accurately transcribed and quantitatively analyzed. This pioneering study directly addressed disfluency in a global context. Today, several spoken Japanese corpora have been constructed and are publicly available, including the Corpus of Spontaneous Japanese (CSJ) and the Corpus of Everyday Japanese Conversation (CEJC). In these corpora, simple disfluencies such as filled pauses and word fragments are transcribed at the initial transcription stage, but higher-order disfluencies such as self-repairs, self-addressed questions, and insertions have not been annotated. This is partly because there is no established methodology for such annotation schemes. This chapter illustrates how to classify and annotate disfluencies that occur in Japanese speech corpora and explains how self-repairs can be classified and annotated using a theory-based language production framework.