ABSTRACT

Thematic analysis is best manifested by contrasting collocations 1 such as “shipping pacemakers” vs. “shipping departments”. In the first pair, the pacemakers are being shipped, while in the second one, the departments are probably engaged in some shipping activity, but are not being shipped.

Text pre-processors, intended to inject corpus-based intuition into the parsing process, have blurred the distinction between such cases. Although statistical tagging has attained impressive results overall, the analysis of multiple-content-word strings (i.e., collocations) has presented a weakness, and caused accuracy degradation.

In this paper we present a tagging algorithm designed to serve as a front end for a syntactic parser. Training over a large corpus, and exploiting distributional properties of collocations, the tagger performs accurate thematic analysis.

The critical advantage of this algorithm is the fact that training can be performed over raw (i.e., no need for manual tagging) corpus, thus enabling instantaneous training over any new corpus that requires text processing.

We provide empirical results: NLcp (NL corpus processing) acquired a 250,000 thematic-relation database through the 85-million word Wall Street Journal Corpus. Tested over a 66,000-word financial news stories, it drastically improved tagging of content words. The integration of the tagger with a parser is now under way, in a system that extracts joint venture date from newspapers. 2