Significance Testing in Clustering

doi:10.1201/b19706-21

ABSTRACT

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 15.2 Overview of Significance Tests for Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 317

15.2.1 Null Models and Test Statistics for Euclidean Data . . . . . . . . . . . . . . . . 317 15.2.2 Null Models and Test Statistics for Dissimilarity Data . . . . . . . . . . . . . . 319 15.2.3 Nonstandard Tests, Null Models, and Parametric Bootstrap . . . . . . . . . 320 15.2.4 Testing Clusters and Their Number . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

15.3 The Method of SigClust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 15.3.1 Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 15.3.2 Invariance Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 15.3.3 Eigenvalue Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 15.3.4 Diagnostics for Assessing Validity of SigClust Assumptions . . . . . . . . . 325

15.4 Cancer Microarray Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 15.5 Open Problems of SigClust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 15.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

In this chapter, we give an overview of principles and ideas for significance testing in cluster analysis. We review test statistics and null models proposed in the literature and discuss issues such as parametric bootstrap, estimating the number of clusters by use of significance tests and p-values for single clusters. Then, we focus on the Statistical Significance of Clustering (SigClust) method which is a recently developed cluster evaluation tool specifically designed for testing clustering results for high-dimensional low sample size data. SigClust assesses the significance of departures from a Gaussian null distribution, using invariance properties to reduce the needed parameter estimation. We illustrate the basic idea and implementation of SigClust and give examples.