ABSTRACT

In the currently most popular framework of ASR using fundamental equation 12.1, an inventory of elementary HMMs corresponding to basic linguistic units of speech (i.e., phonemes) is used to build a larger HMM for a word. A sequence of acoustic feature vectors extracted from the input speech waveform is seen as a realization of a concatenation of elementary processes described by the HMMs. As we discussed in Section 3.5, an HMM is a composition of two stochastic processes, a hidden Markov chain, which accounts for temporal variability, and an observable process, which accounts for spectral variability. This combination has proven to be rather powerful to encompass the most important sources of speech variability, while at the same time be sufficiently flexible to allow for efficient construction of practical ASR systems.