ABSTRACT

The first task of establishing a word segmentation standard is to define a basic unit for segmentation. This chapter talks about two basic principles and six subsidiary principles, such that corpus segmentation can both reflect speakers' linguistic intuition, as well as achieve felicity according to linguistic theories. Basic principles justify a segmentation unit based on two aspects: semantics and grammar. It is important to note that both principles are principle of combination. This is because the null hypothesis of word segmentation is that each character is a word; hence the task of word segmentation can be viewed as going through a character string to determine which sub-strings for characters should be combined to form word unit. In addition to the basic theoretical principles, we must also have operational principles to guide the actual operation of segmentation or combination in implementing word segmentation.