ABSTRACT

We proposed a conservative weighted feature selection algorithm in Chapter 14, in which each feature in a cluster will be assigned a weight according to how relevant the feature is to the cluster. We presented a method estimating this relevance based on histogram analysis in Chapter 14. In this chapter, we propose two more weighted feature selection mechanisms. One is an aggressive (compared to the conservative one which we discussed in Chapter 14) histogram analysis, the other one is based on chi-square calculation. For the aggressive method, the idea is similar to the conservative method. If the feature value distribution is denser for a cluster, the feature is more important to, or representative of, the cluster and this feature will carry more weight for that cluster. Intuitively, the feature will carry more weight when its value is sparsely distributed as compared to uniformly distributed. For this, first, we calculate density for a feature in a particular cluster. Next, we compute weight based

on density in an aggressive manner. The conservative approach works gracefully; on the other hand, the aggressive approach takes into account the reciprocal of density, which makes density have more effect on weight. The concern of the chi-square method is the difference between the global feature distribution (GD), based on all training data points, and local feature distribution (LD), based on the data points in one specific cluster. If the GD and LD of a feature are very similar, it means the feature is not representative of the corresponding cluster. We will assign less weight to the feature. Now, the question is how we would like to quantify the difference between a feature LD and GD. The chi-square method is applied to measure this difference. We also studied linear discriminative analysis for feature weighting.