ABSTRACT

Modern automatic speech recognition (ASR) technology [9, 10, 56, 105, 84, 57, 20, 125, 46] is based on a communication theoretic view of the generation, acquisition and transmission, and perception of speech [6]. Figure 3.1 (adapted from Juang’s keynote speech in NNSP’96 [68]) shows a conceptual model of a noisy channel for speech generation and signal capturing. The goal of speech recognition is then defined as recovering the word sequence, W, from the acoustic signal, X. This can also be viewed as a decision problem, i.e., based on the information in X and the other relevant aspects of the problem, we attempt to make the best decision (in some sense) of the W that has been embedded in X. For the simplicity of discussion, we can view each possible word sequence W as a class. Let us assume there are total M unique classes. So, speech recognition consists in finding optimal (in some sense) decision rules for classification of the observation X into one of M fixed classes. Depending on different criteria, there exist many decision rules. Not all of them are of equal value in practice. Because of the different sources of variability as shown in Figure 3.1, the speech signal X is usually featured by uncertainty, variability, lack of determinism, and stochasticity. This makes the statistical pattern recognition approach [100, 44, 18, 71, 55, 19] a natural choice for formulating and solving the

0-8493-1232-9/03/$0.00+$1.50 © 2003 by CRC Press LLC

ASR problem as described briefly in the following. First, the statistical models for the channels in Figure 3.1 are simplified as follows:

• A word sequence W and the associated acoustic observation X are viewed as a jointly distributed random pair (W, X). For notational simplicity, we will use the same symbol to denote both the random variable and the value it may assume.