“Sou” WenJieZi | 5 | Studies on Identification of Words and Segmentati

ABSTRACT

The segmentation standard proposed in the original CNS 14366 has the design of three levels of implementation. Bound morphemes should be attached to neighboring words to form a segmentation unit when possible. A string of characters that have a high frequency in the language or high co-occurrence frequency among the components should be treated as a segmentation unit when possible. A string whose meaning cannot be derived by the sum of its components should be treated as a segmentation unit. Modifier-modified verbs with a bi-syllabic structure should be treated as one segmentation unit as much as possible. Units that are included in standard dictionaries should be segmented as independent words. Units that follow word formation rules should be combined under the principle of expressiveness.