ABSTRACT

This chapter considers issues relevant to systems for recognizing continuously spoken utterances using large vocabularies, which may be anywhere from a few thousand up to around 100,000 different words. The semantic content of the message is required, and any recognition errors do not matter provided that the meaning is not changed. The interactive nature of many speech-understanding tasks, together with the fact that the subject area is often restricted, means that the relevant vocabulary at any one point can be much smaller than the total vocabulary that is needed for more general transcription tasks. The need to make the best use of any available acoustic training data has important consequences for the design of the acoustic-model component. Initial developments in parameter sharing between context-dependent models concentrated on clustering together triphone models, to give generalized triphones. In addition to the benefits in terms of robustness, computation and storage, state tying has the potential to lead to models with better discrimination.