ABSTRACT

Tagging by the Hidden Markov Model (HMM) is a successful technique for assigning grammatical information to words in a corpus. In outline, the tagger picks the most probable grammatical category, or “tag”, for each word by combining the probability of possible tag sequences with the probability of each hypothesized tag for the words, considered in isolation from the context. There are two standard algorithms, known as forward–backwards (FB) and Viterbi. Viterbi achieves greater computational efficiency than FB by pruning out some hypotheses; FB has the advantage of assigning a score to every hypothesized tag. The model, in the form of the word–tag (lexical) and tag–tag (transition) probabilities may be estimated from a training corpus which has been tagged by some other means, for example by a human annotator. Alternatively, the model may be derived using a procedure called Baum–Welch re-estimation, which iteratively improves the quality of an approximate model using probabilities derived by the FB algorithm. The tagging algorithms are robust, in the sense that every word will receive a tag, even in ungrammatical text. For a more detailed description including the mathematical foundations of the technique, see Sharman (1990) and Cutting et al. (1992).