ABSTRACT

FIGURE 4.8: The PR (left) and cumulative recall (right) curves of the LOF , and BPrule models.

4.4.1.3 Clustering-Based Outlier Rankings (ORh)

The next outlier ranking method we consider is based on the results of a clustering algorithm. The ORh method (Torgo, 2007) uses a hierarchical agglomerative clustering algorithm to obtain a dendrogram of the given data. Dendrograms are visual representations of the merging process of these clustering methods. Cutting these trees at different height levels produces different clusterings of the data. At the lowest level we have a solution with as many groups as there are observations on the given training data. This is the initial solution of the iterative algorithm used by these methods. The next steps of this algorithm decide which two groups from the previous step should be merged into a single cluster. This merging process is guided by some criterion that tries to put together observations that are more similar to each other. The iterative process is stopped when the last two groups are merged into a single cluster with all observations. The dendrogram describes the entire merging process. The function hclust() of the base package stats implements several variants of this type of clustering. The object returned by this function includes a data structure (merge) that includes information on which cases are involved in each merging step. The ORh method uses the information in this data structure as the basis for the following outlier ranking method. The basic idea is that outliers should offer greater resistance to be merged and thus, when they are finally merged, the size difference between the group to which they belong and the group to which they are being merged should be very large. This reflects the idea that outliers are rather different from other obser-

should clearly decrease the homogeneity of the resulting group. Occasionally, outliers are merged at initial stages with other observations, but only if these are similar outliers. Otherwise, they will only be merged at later stages of the clustering process and usually with a much larger group of cases. This is the general idea that is captured by the ORh method. This method calculates the outlier score of each case as follows. For each merging step i involving two groups (gx,i and gy,i), we calculate the following value:

ofi(x) = max (

0, |gy,i| − |gx,i| |gy,i|+ |gx,i|

) (4.3)

where gx,i is the group to which x belongs, and |gx,i| is the group cardinality. Note that the members of the larger group involved in the merge get the

score 0, as we are interested in members of small groups. Each observation can be involved in several merges throughout the iterative process of the hierarchical clustering algorithm-sometimes as members of the larger group, others as members of the smaller group. The final outlier score of each case in the data sample is given by

OFH(x) = max i ofi(x) (4.4)

The function outliers.ranking() of our book package implements this method. The following function uses the ORh method to obtain the outlier score of a test set of reports and obtains the usual evaluation statistics:

for method. Once again we have used the approach of handling the products individually, primarily for the same reasons described for the LOF method. Nevertheless, the outliers.ranking() function can receive as argument a distance matrix of the observations being ranked, instead of the dataset. This means that we can obtain this matrix using any distance function that handles mixed-mode data (e.g., function daisy() in package cluster). However, if you decide to try this you will need large computation resources as clustering such a large dataset will require a large amount of main memory and also a fast processor. Even using this approach of handling each product separately, the following code that runs the full hold-out experiments will surely take a while to run on any normal computer.