ABSTRACT

The third area of speech technology applications, covered in Part IV of this book, is speech synthesis. Strictly speaking, speech synthesis refers to the automatic generation of a speech signal, also called waveform synthesis, starting typically from a phonetic transcription with its associated prosody. It often uses previously analyzed digital speech data. Such waveform synthesis is usually a final module of a larger text-to-speech (TTS) system, which starts from a textual input (such as a word sequence). A TTS system first performs linguistic and text processing, including text normalization. It then uses "letter-to-sound" conversion to generate the phonetic transcription, which is fed into the speech synthesis module. A separate prosody or intonation generation module provides an additional input to the speech (waveform) synthesis module. Thus these modules act as a front-end processor to the speech synthesis. As in the previous two application areas just covered, i.e., speech recognition and enhancement, both dynamics and optimization play important roles in speech synthesis. In this chapter, we will emphasize such roles in covering basic techniques for waveform synthesis, as well as text processing and intonation generation that form integral parts of a full TTS system.