ABSTRACT

Performance in the Spectral Transform Method . . . . . . . . . . . . . . . . 318 15.5 Performance Portability: Supporting Options and Delaying

Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 15.6 Case Study: Engineering Performance Portability into the

Community Atmosphere Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 15.7 Case Study: Porting the Parallel Ocean Program to the Cray

X1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 15.8 Monitoring Performance Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 15.9 Performance at Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 15.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 15.11 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

The Community Climate System Model (CCSM) is a modern worldclass climate code consisting of atmosphere, ocean, land, and sea-ice components coupled through exchange of mass, momentum, energy, and chemical species [98, 101]. Investigating the impact of climate change is a computationally expensive process, requiring significant computational resources [263]. Making progress on this problem also requires achieving reasonable throughput rates for individual experiments when integrating out to hundreds or thousands of simulation years. Climate models employ time-accurate numerical methods, and exploitation of significant parallelism in the time-direction has yet to be demonstrated in production climate models. For the CCSM this leaves functional parallelism between the component models, parallelizing over

the spatial dimensions, and loop-level parallelism exploited within a sharedmemory multi-processor compute node or a single processor. Due to as yet unavoidable parallel inefficiencies, the size of the spatial computational grids that can be used and still achieve the required throughput rates for long time integrations is small compared to other peta-and exa-scale computational science. As a consequence the maximum number of processors that can be applied in a single experiment is also relatively small. Parallel algorithms need to be highly optimized for even a modest number of computational threads to make best use of the limited amount of available parallelism.