ABSTRACT

An account of the processes whereby the acoustic waveform of speech is converted by the human listener to a representation that is isomorphic with a sequence of allophones is presented. This account is based on the author’s auditory-perceptual theory of phonetic recogninition (Miller, 1984a b c). Three stages of processing are identified. In Stage I, the acoustic waveform is converted into sensory variables that represent the short-term spectral patterns associated with the waveform as well as their loudnesses and goodnesses. A key notion is that these patterns are represented as points in an phonetically relevant auditory-perceptual space, which is usually conceived as having three dimensions. In Stage II, these variables are integrated by a sensory-perceptual transformation into a single, unitary response. This perceptual response (perceptual pointer) can also be represented at any moment as a point in the auditory-perceptual space, and over time a sequence of points or a perceptual path is generated. Stage III is the perceptual-linguistic transformation. Here the dynamics of the perceptual pointer in relation to perceptual target zones within the auditory-perceptual space cause those target zones to issue category codes or neural symbols that are isomorphic with the allophones of the language. In a fourth stage, not dealt with here, the sequence of category codes so generated is converted to units isomorphic with the language’s lexicon. In this paper, the auditory-perceptual space is described and some of the characteristics of the preliminary estimates of the target zones are described. Also, the hypothesized sensory-perceptual transformation is described as is a possible segmentation maneuver.