ABSTRACT

Speech recognition systems are expected to play important roles in an advanced multimedia society with user-friendly human-machine interfaces [1]. The field of automatic speech recognition has witnessed a number of significant advances in the past 5-10 years, spurred on by advances in signal processing, algorithms, computational architectures, and hardware. These advances include the widespread adoption of a statistical pattern recognition paradigm, a data-driven approach which makes use of a rich set of speech utterances from a large population of speakers, the use of stochastic acoustic and language modeling, and the use of dynamic programmingbased search methods [2, 3, 4].

The state-of-the-art in automatic speech recognition can be addressed in several ways. Figure 6.1 illustrates the progress of speech recognition and understanding technology according to generic application areas, ranging from isolated word or command recognition to natural conversation between human and machine. The complexity of these generic application areas is characterized along two dimensions: the size of the vocabulary and the speaking style. It should be obvious that the larger the vocabulary, the more difficult the application task. Similarly, the degree of constraints in the speaking style has a very direct influence on the complexity of the application; a free conversation full of slurring and extraneous sounds such as “uh”, “um”, and partial words is far more difficult than words spoken in a rigidly discrete manner. Thus, the difficulty of an application grows from the lower left corner to the upper right corner in the figure. The three bars in the figure demarcate the applications that can and cannot be supported by the technology for viable deployment in the corresponding time frame. It should be