ABSTRACT

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 14.4 CPC Design and Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

14.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 14.4.2 MPQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 14.4.3 The CPC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 14.4.4 CPC Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 14.4.5 Optimization and Implementation Details . . . . . . . . . . . . . . 209

14.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 14.5.1 Datasets and Clustering Algorithms . . . . . . . . . . . . . . . . . . . . 210 14.5.2 CPC Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 14.5.3 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 14.5.4 Categorical Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 14.5.5 Numerical Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 14.5.6 Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 14.5.7 CPC Execution Time and Memory Use . . . . . . . . . . . . . . . . . 214 14.5.8 Effect of Pattern Limit on Clustering Quality . . . . . . . . . . 215

14.6 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 14.6.1 Alternate MPQ Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 14.6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

Cluster analysis is concerned with grouping objects according to measured or intrinsic characteristics or similarity [196]. Clustering is often applied for

exploratory data analysis, where prior domain knowledge is scarce. This can be problematic for traditional clustering approaches, which often rely on a distance function to define the similarity between objects and evaluate the clustering’s quality. The design of a good distance function requires a deep understanding of the dataset under consideration, and standard distance functions (e.g. Euclidean) may be inappropriate for the domain. Moreover, it is well known that distance is not very meaningful in high-dimensional data [51], which includes most cases of categorical data.