ABSTRACT

Modern scientific applications and experiments become increasingly data intensive. Large experiments, such as high-energy physics simulations, genome mapping, and climate modeling generate data volumes reaching hundreds of terabytes.41 Similarly, remote sensors and satellites are producing extremely large amounts of data for scientists.19,82 In order to process these data, scientists are turning toward distributed resources owned by the collaborating parties to provide them the computing power and storage capacity needed to push their research forward. But the use of distributed resources imposes new challenges.52 Even simply sharing and disseminating subsets of the data to the scientists’ home institutions is difficult. The systems managing these resources must provide robust scheduling and allocation of storage and networking resources, as well as efficient management of data movement.