Numerical validation and performance optimization on GPUs of an application in atomic physics

doi:10.1201/b16051-25

Chapter

Numerical validation and performance optimization on GPUs of an application in atomic physics

ABSTRACT

Laboratoire d’Informatique de Paris 6, Universite´ Pierre et Marie Curie, France

Stan Scott

School of Electronics, Electrical Engineering & Computer Science, The Queen’s University of Belfast, United Kingdom

16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 16.2 2DRMP and the PROP program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

16.2.1 Principles of R-matrix propagation . . . . . . . . . . . . . . . . . . . . . 373 16.2.2 Description of the PROP program . . . . . . . . . . . . . . . . . . . . . . 375 16.2.3 CAPS implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376

16.3 Numerical validation of PROP in single precision . . . . . . . . . . . . . . . 377 16.3.1 Medium case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 16.3.2 Huge case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

16.4 Towards a complete deployment of PROP on GPUs . . . . . . . . . . . . 381 16.4.1 Computing the output R-matrix on GPU . . . . . . . . . . . . . . 382 16.4.2 Constructing the local R-matrices on GPU . . . . . . . . . . . . . 384 16.4.3 Scaling amplitude arrays on GPU . . . . . . . . . . . . . . . . . . . . . . . 385 16.4.4 Using double-buffering to overlap I/O and computation 385 16.4.5 Matrix padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386

16.5 Performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 16.5.1 PROP deployment on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 16.5.2 PROP execution profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

16.6 Propagation of multiple concurrent energies on GPU . . . . . . . . . . . 391 16.7 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

As described in Chapter 1, GPUs are characterized by hundreds of cores and theoretically perform one order of magnitude better than CPUs. An important factor to consider when programming on GPUs is the cost of data transfers between CPU memory and GPU memory. Thus, to have good performance on GPUs, applications should be coarse-grained and have a high arithmetic intensity (i.e., the ratio of arithmetic operations to memory operations). Another important aspect of GPU programming is that floating-point operations are preferably performed in single precision, if the validity of results is not impacted by that format. The GPU compute power for floating-point operations is indeed greater in single precision than in double precision. The peak performance ratio between single precision and double precision varies, for example, for NVIDIA GPUs from 12 for the first Tesla GPUs (C1060), to 2 for the Fermi GPUs (C2050 and C2070), and to 3 for the latest Kepler architecture (K20/K20X). As far as AMD GPUs are concerned, the latest AMD GPU (Tahiti HD 7970) presents a ratio of 4. Moreover, GPU internal memory accesses and CPU-GPU data transfers are faster in single precision than in double precision because of the different format lengths.