ABSTRACT

The People’s Republic of China started to work on word segmentation standard as early as 1987. Hence The Word Segmentation Standard of Contemporary Chinese Language for Information Processing explicitly recognized that their segmentation units did not equal to words and the target is not the linguistic word, but a processing unit for information processing of Chinese texts. Additionally, classifiers and numerals before nouns are segmented as independent units according to the segmentation rules, except for a few compounds with specific meanings. Ideally, the segmentation standard should provide a complete and robust guidance to segment corpus. The lexicon is the foundational reference of segmentation and no word segmentation decisions could be carried out without referring to this lexicon. The design leads to a high dependency between standardized lexicon and segmentation standards. The segmentation standard has undergone years of preparation and fine-tuning by experts through multiple meetings and discussions.