Solving Data Clustering Problem as a String Search Problem

doi:10.1201/9780203497159-29

ABSTRACT

CONTENTS 25.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 25.2 Mathematical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 25.3 Stability of MST Structure Under Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 25.4 Statistical Assessment of Identified Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 25.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 25.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

The various high-throughput efforts such as the Human Genome Project have opened the flood gate of biological data, including enormous amount of sequences, structure, expression, and interaction data. The rates of data generation far exceeds our current capability of analysis and interpretation. New ideas and approaches are urgently needed to establish greatly improved capabilities for biological data analysis. Data clustering is fundamental to mining a large quantity of biological data. In this paper, we present a new approach to address this challenge based on a rigorous relationship between data clusters and a sequential representation derived from the Prim algorithm for constructing a minimum spanning tree of a data set. This approach converts a clustering problem to a problem of partitioning a one-dimensional profile (a sequential representation), which facilitates easy identification of “dense” clusters from a noisy background. We have applied the method to a number of data mining problems for high-throughput biological data, including cluster identification in gene expression profiles and regulatory binding site identification. The results have shown that our method can identify underlying patterns from large-scale biological data effectively and efficiently.