ABSTRACT

The vast majority of people benefit from seeing the face of the talker, and, thus, the visible movement of the lips, teeth, tongue, and jaw, when speech perception is difficult because of acoustical distortions such as noise or reverberation, but they do so to markedly varying extents. The benefit received correlates highly with the ability to lipread without acoustical information. Although such measures can be reliable, correlations with performance on other perceptual and cognitive tasks are inconsistent and surprisingly low. These observations mesh neatly with Alvin Liberman's suggestions that phonetic perception is subserved by a biologically (and, thus, perceptually and cognitively) distinct system specialized for recovering the articulatory gestures produced by talkers. A key issue in understanding how such a system might operate is to describe the process of audio-visual integration and in particular the representations of the auditory and visual streams of information at their conflux. This chapter will consider the merits of several representations including: a) vectors describing the magnitudes of independent acoustical and optical parameters of the speech waveform and the visible shape of the mouth, and b) time-varying kinematic patterns providing evidence of articulatory dynamics. The aim is not so much to list the empirical evidence for and against each representation as to clarify the essential differences between them, and to specify the contexts in which each has application and explanatory power.