ABSTRACT

On the basis of the statistical distribution of words in texts Burgess (1998), and others have computed semantic representations producing humanlike performance in tasks that are considered to presuppose language understanding. These representations are calculated in three steps: First the co-occurrences of words are counted. A co-occurrence is defined as the joint appearance of two words or within a maximum distance from each other. In the second step for each word a context vector is computed whose elements are the normalized co-occurrences with all other words of the vocabulary. In a third step similarities between words are computed as the dot products of their context vectors. These models make no assertions about how the postulated semantic representations are learned. We have developed and implemented an unsupervised learning algorithm which is based on the stimulus sampling theory of Estes (1950). It computes incrementally, while reading, a net of associations between words that correspond to the normalized co-occurrence values. These associations agree with the results observed in the free word association experiment.