ABSTRACT

The goal of text-to-speech (TTS) synthesis is to convert arbitrary input text to intelligible and natural sounding speech so as to transmit information from a machine to a person. The methodology used in TTS is to exploit acoustic representations of speech for synthesis, together with linguistic analyses of text to extract correct pronunciations and prosody in context. The text analysis and normalization module in the front-end determines to a large extend the “what” and “how” of the resulting synthetic speech. Linguistic analysis in the front-end encompasses the determination of parts-of-speech, word sense, emphasis, appropriate speaking style, and speech acts. With the availability of good automatic speech labeling tools, concatenative speech synthesis has embraced the use of multi-hour voice databases. Eliciting the desired voice characteristics from a voice talent that is being recorded for a unit-selection synthesis voice database could be essential for customers accepting an automated dialog system that speaks with a TTS voice.