ABSTRACT

For the following Chinese sentence, the process of extracting terms is shown as shown: Chinese sentence: bai zuo zi shang fang le yi ge ping guo bi ji ben dian nao bao . Word segmentation results: bai/ zuo zi/ shang/ fang/ le/ yi/ ge/ ping guo/ bi ji ben/ dian nao/ bao/ ./ Part of speech tagging results: bai/a zuo zi/n shang/f fang/v le/u yi/m ge/q ping guo/n bi ji ben/n dian nao/n bao/n ./w Matched patterns: a+n, n+n, n+n+n, n+n+n+n Candidate terms: bai/ zuo zi/, ping guo/ bi ji ben/, bi ji ben/ dian nao/, dian nao/ bao/, ping guo/ bi ji ben/ dian nao/, ping guo/ bi ji ben/ dian nao/ bao/

3 CHINESE TERM CLASSIFIER

An important characteristic of the term is that it occurs frequently in the corpus. If two or more Chinese words co-occur frequently, the probability that they constitute a term is higher. So, frequency can be used to measure the probability that a sequence of Chinese words is a term. For a sequence of Chinese words S=w1, w2, …, wn, its frequency is denoted as freq(S). It can be estimated from a large corpus, which is computed as shown in formula (1). Here, count (S) is the number that S appears in the corpus and n is the number of Chinese words in the corpus.