ABSTRACT

The extent of linguistic data has traditionally both stimulated a statistical response, such as work by Zipf (1932) and Yule (1944), and at the same time frustrated such approaches for inability to manage more than a small corpus by hand. Cheap computers have encouraged the development of large machine-readable corpora, such as the Associated Press newswire, which provides 40 million words per year. They have also made management of such large corpora possible. This leaves the stimulus and reduces the frustration, which has led to an increased demand for new statistical approaches to linguistics, as exemplified in Church and Gale (1991), Gale and Church (1990), and Church et al. (1990).