Toward expressive and disfluent speech synthesis

doi:10.4324/9781003648369-13

ABSTRACT

This paper describes the use of a publicly available AI-based speech synthesizer which was newly trained on 32,535 utterances from the recording of everyday conversations by a Japanese female speaker, focusing on two efforts to make it more expressive in a human-like way: (1) accounting for the category of interlocutor, and (2) allowing for disfluencies. Interlocutors in the original data were categorized into four groups: family, friend, child and stranger. This information was used in the training process. Our survey results showed that respondents perceived acoustic differences in speaking style between those directed “inward (i.e., to family members)” versus “outward (i.e., to strangers)” both in the original data and the synthesized speech. Incorporating some frequently observed disfluencies in general Japanese speech such as fillers, phrase-final rising intonation or word-internal prolongation makes the synthesized speech sound more natural not only because human speech is inevitably disfluent but also because certain disfluencies are connected to a speaker’s attitude such as hesitation or surprise in Japanese communication. Being able to change the speaking style depending on who one is talking to, and to generate disfluent speech, can make the synthesizer ever more expressive.