ABSTRACT

The data imbalance problem has become a challenge in many real-life classification applications. Although numerous synthetic over-sampling techniques have been put forward to alleviate this problem, most of them do not consider the distribution of the minority examples and may generate noisy synthetic minority examples which overlap the majority examples. In this regard, an improved synthetic over-sampling algorithm, named Clustering Based Random Over-Sampling Examples (CBROSE) algorithm, for balancing the binary class data sets is presented in this paper. CBROSE generates synthetic minority examples by combining Kmeans clustering algorithm with the basic mechanism of existing synthetic over-sampling methods. The synthetic minority examples created by CBROSE always be located in an elliptical area centered at the observed minority example. The experimental results based on 5-folder cross validation show the effective-ness of CBROSE on some real-life data sets in terms of AUC.