ABSTRACT

In a system which performs name retrieval from spellings, neural networks were used for three components: a pitch tracker, a broad phonetic classifier used to find letters and letter-internal segment boundaries, and a letter classifier. The broad phonetic classifier uses spectral features in a fixed window around each 3 ms frame of the utterance. Its output is a score for each of several broad phonetic classes (e.g. voiced stop, iy—the vowel in E). A Viterbi search finds the most likely letter segment sequence based on these scores. This defines the letter boundaries and internal segmentation of each letter. The letter classifier uses carefully selected features from the whole letter based on our knowledge of the acoustic differences between the letters. The features are anchored by segment boundaries because they are especially useful in fine phonetic distinctions (e.g. ‘B’ versus ‘D’). A neural network classifier is trained using these features extracted from letters spoken by 120 different speakers. We achieved 96% accuracy on an independent test set from 30 new speakers. When searching a database of 50000 names, we achieved 95% first-choice name retrieval for 1020 spelled names. This section describes all three neural networks briefly, but focuses on the broad phonetic and letter classifiers.