ABSTRACT

I believe that the long-term solution to the problem of connected-speech recognition will depend on two future developments. One is a very sophisticated acoustically based model of speech production; this will involve synthesis by rule from lin· guistic units of some sort, and will copy human performance much more closely than any rule system has so far (and will include the ability to model particular speakers). The second development will be a distance metric for the pattern-matching process that reruly takes into account t.he phonetically important properties of the acoustic signal. The process won't detect features, but it will highlight, or make more explicit, those properties that arc known to be phonetically significant. Instead of merely working with the output of a simulation of the peripheral auditory system, one would need some functional model of the higher levels that would not make categorical decisions, but would give prominence to such aspects as rapid movements of spectral peaks or sudden changes of level.