ABSTRACT

This work investigates how viewers extract phonetically relevant visual information from dynamic audiovisual speech. It tests the hypothesis that low-resolution spatial and temporal information is sufficient for speech perception. Audiovisual perception studies were carried out using spatial and temporal low-pass filters applied to video image sequences for Japanese and English sentences at the rate of 30 frames/s. In the time domain, the results indicate that the information contained in the frequency band above the average rate of opening and closing the vocal tract (i.e., >6 Hz) can be removed without significant degradation of audiovisual speech intelligibihty. In the space domain, it was observed that intelligibility is not degraded if spatial frequencies below 18.8 cycles/face are preserved. The tests used Gaussian filters, whose monotonie smooth attenuation prevented visual artifacts. However, the lack of a flat pass-band and the wide transition band of Gaussian filters made it difficult to analyze accurately the effects of the combination of temporal and spatial filters. For this reason, Chebyshev filters, which have a sharper attenuation and a flatter pass-band, have been used in the time domain to allow a more precise analysis of how audiovisual speech information is distributed in space and time. A detailed analysis of the frequency contents of video sequences also allows a deeper understanding of audiovisual speech perception.