ABSTRACT

The goal of text-to-speech (TTS) synthesis is to convert arbitrary input text to intelligible and natural

sounding speech so as to transmit information from a machine to a person. Therefore, TTS goes beyond

simple ‘‘cut-and-paste’’ systems used, for example, in some telecom applications to read back a phone number.

Such systems string together words spoken in isolation and the artifacts of such a scheme are often

perceptible. The methodology used in TTS is to exploit acoustic representations of speech for synthesis,

together with linguistic analyses of text to extract correct pronunciations (‘‘content’’; what is being said) and

prosody in context (‘‘melody’’ of a sentence; how it is being said). Synthesis systems are commonly evaluated

in terms of three characteristics: accuracy of rendering the input text (does the TTS system pronounce, e.g.,

acronyms, names, URLs, email addresses as a knowledgeable human would?), intelligibility of the resulting

voice message (measured as a percentage of a test set that is understood), and perceived naturalness of the

resulting speech (does the TTS sound like a recording of a live human?). Today, applications of TTS are in

automated telecom services (e.g., name and address rendering), as a part of a network voice server for e-mail

(e-mail by phone), in directory assistance, as an aid in providing up-to-the-minute information to a telephone

user (e.g., business locator services, banking services, helplines), in computer games, and last but not least, in

aids to the handicapped (e.g., cosmologist Steven Hawking). For a much more detailed overview of TTS and

its applications, see Reference 1 and 2.