ABSTRACT

This paper describes a model for discovering coherent texts from unconstrained textual data. 1 The model is both linguistically and statistically motivated. It derives a linguistic motivation from the view that discourse consists of textual units called “discourse segments” (Nomoto & Nitta 1993), while, statistically, we are guided by ideas from the information retrieval literature (Salton 1988). As has been the case with research on information retrieval, previous quantitative approaches to text segmentation (Hearst 1993; Kozima 1993; Youmans 1991) have paid little attention to the linguistic structure of discourse and defined it away as a bag of words or sentences. 2 Part of our concern here is with explicating possible effects of a discourse segment on the quantitative structuring of discourse.