CONTENTS 20.1 Introduction 447 20.2 Astrophysics at the Crossroads of Science and Technology 448

20.2.1 Virtual Observatories: Federated Data Collections from Multiple Sky Surveys 450

20.2.2 Mining of Distributed Data 452 20.2.3 VO-Enabled Data Mining Use Cases 454 20.2.4 Distributed Mining of Data 456

20.3 Concluding Remarks 459 Acknowledgments 459 References 459

20.1 INTRODUCTION New modes of discovery are enabled by the growth of data and computational resources (i.e., cyberinfrastructure) in the sciences. This cyberinfrastructure includes structured databases, virtual observatories (distributed data, as described in Section 20.2.1 of this chapter), high-performance computing (petascale machines), distributed computing (e.g., the Grid, the Cloud, and peer-to-peer networks), intelligent search and discovery tools, and innovative visualization environments. Data streams from experiments, sensors, and simulations are increasingly complex and growing in volume. This is true in most sciences, including astronomy, climate simulations, Earth observing systems, remote sensing data collections, and sensor networks. At the same time, we see an emerging confluence of new technologies and approaches to science, most clearly visible in the growing synergism of the four modes of scientific discovery: sensors-modeling-computing-data (Eastman et al. 2005). This has been driven by numerous developments, including the information explosion, development of large-array sensors, acceleration in high-performance computing (HPC) power, advances in algorithms, and efficient modeling techniques. Among these, the most extreme is the

growth in new data. Specifically, the acquisition of data in all scientific disciplines is rapidly accelerating and causing a data glut (Bell et al. 2007).