ABSTRACT

CONTENTS 19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 19.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 19.3 Global PCA for Distributed Homogeneous Databases . . . . . . . . . . . . . . . . . . . . . 327 19.4 Global PCA for Distributed Heterogeneous Databases . . . . . . . . . . . . . . . . . . . . 330 19.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 19.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

Previous data mining activities have mostly focused on mining a centralized database. One big problem with a centralized database is its limited scalability. Because of the distributed nature of many businesses and the exponentially increasing amount of data generated from numerous sources, a distributed database becomes an attractive alternative. The challenge in distributed data mining is how to learn as much knowledge from distributed databases as we do from the centralized database without costing too much communication bandwidth. Both unsupervised classification (clustering) and supervised classification are common practices in data mining applications, where dimensionality reduction is a necessary step. Principal component analysis is a popular technique used in dimensionality reduction. This paper develops a distributed principal component analysis algorithm which derives the global principal components from distributed databases based on the integration of local covariance matrices. We prove that for homogeneous databases, the algorithm can derive the global principal components that are exactly the same as those calculated based on a centralized database. We also provide quantitative measurement of the error introduced in the recompiled global principal components when the databases are heterogeneous.