ABSTRACT

Within some limits, the various classes of speech sounds (phonemes) of any human language system, that is, the respective inventory of consonants and vowels, can be characterized in terms of fairly specific features of the acoustic signal. For example, the spectrogram of isolated vowels or stationary vocalic segments shows rather distinct steady-state maxima of spectral energy distribution (formants). Initial rapid up-or down-going shifts of these formant structures extending across a few tens of milliseconds (formant transitions) cue the voiced stop in consonant-vowel (CV) concatenations such as /ba/ and /da/ (Liberman, 1996; Fig. 11.1). These CV units represent the most frequent syllables across the world’s languages and are mastered first during speech development. Formant transitions have been assumed to represent the acoustic correlates of articulatory lip and tongue movements, for example, the shift from bilabial contact to the subsequent configuration of vowel /a/ during production of /ba/. Most noteworthy, the various acoustic “informationbearing elements” (Suga, 1994) of spoken utterances exhibit considerable variability: The same phoneme may be signaled by quite different acoustic patterns

depending upon preceding and succeeding sounds (coarticulation effects) or even within the same linguistic context (trading relations; for an example, see Fig. 11.2, right panel). Listeners, therefore, have to integrate multiple cues, within the time constraints of ongoing verbal communication, to extract speech sound categories from the acoustic signal.