ABSTRACT

Spectral and time-frequency representations and production and synthesis models for speech coding, speech synthesis, speech recognition, and speech enhancement are presented, enabling an understanding of the basic pillars of current speech technology and providing hints towards future improvements. Classical quasi-stationary representations such as the short-time Fourier transform (STFT) and the associated speech spectrogram are presented, and linear prediction (LP) or autoregressive (AR) models are introduced. In this process, some important parameters arise in the development of LP analysis, providing links with speech production models and with source-filter models. Next some successful classifiers for application in speech recognition and speaker identification are presented such asMarkovian (HMM)models and Gaussian (GMM) models. The important issues of model duration and adaptation are also discussed in connection with speech recognition. Even though the speech signal is intrinsically time-varying, quasi-time-invariant models have been proven useful if associated with proper segmentation of the signal suitable to the application at hand. However, more stringent application requirements demand a time-varying outlook in handling the models which depends on their adaptation. This perspective is introduced by means of time-varying amplitude modulation-frequency modulation (AM-FM) models.