ABSTRACT

Text-to-speech (TTS) synthesis has had a long history, one that can be traced back at least to Dudley’s ‘‘Voder,’’ developed at Bell Laboratories and demonstrated at the 1939 World’s Fair [1]. Practical systems for automatically generating speech parameters from a linguistic representation (such as a phoneme string) were not available until the 1960s, and systems for converting from ordinary text into speech were first completed in the 1970s, with MITalk being the best-known such system [2]. Many projects in TTS conversion have been initiated in the intervening years, and papers on many of these systems have been published.*

It is tempting to think of the problem of converting written text into speech as ‘‘speech recognition in reverse’’: current speech recognition systems are generally deemed successful if they can convert speech input into the sequence ofwords thatwas uttered by the speaker, so onemight imagine that a TTS synthesizerwould start with the words in the text, convert each word one-by-one into speech (being careful to pronounce each word correctly), and concatenate the result together. However, when one considers what literate native speakers of a language must do when they read a text aloud, it quickly becomes clear that things are much more complicated than this simplistic view suggests. Pronouncingwords correctly is only part of the problem faced by human readers: in order to sound natural and to sound as if they understand what they are reading, they must also appropriately emphasize (accent) some words, and de-emphasize others; they must ‘‘chunk’’ the sentence into meaningful (intonational) phrases; they must pick an appropriate F0 (fundamental frequency) contour; they must control certain aspects of their voice quality; they must know that a word should bepronounced longer if it appears in somepositions in the sentence than if it appears in others because ‘‘segmental durations’’ are affected by various factors, including phrasal position.