ABSTRACT
Universite´ Lorraine, Loria UMR 7503 & AlGorille INRIA Project Team, Nancy, France
Stephane Vialle
SUPELEC, UMI GT-CNRS 2958 & AlGorille INRIA Project Team, Metz, France
Jens Gustedt
INRIA Nancy-Grand Est, AlGorille INRIA Project Team, Strasbourg, France
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2 General scheme of synchronous code with
computation/communication overlapping in GPU clusters . . . . . 106 7.2.1 Synchronous parallel algorithms on GPU clusters . . . . . . 106 7.2.2 Native overlap of CPU communications and GPU
computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.2.3 Overlapping with sequences of transfers and
computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2.4 Interleaved communications-transfers-computations
overlapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2.5 Experimental validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3 General scheme of asynchronous parallel code with computation/communication overlapping . . . . . . . . . . . . . . . . . . . . . . . 120 7.3.1 A basic asynchronous scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3.2 Synchronization of the asynchronous scheme . . . . . . . . . . . 126 7.3.3 Asynchronous scheme using MPI, OpenMP, and CUDA 130 7.3.4 Experimental validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4 Perspective: a unifying programming model . . . . . . . . . . . . . . . . . . . . . 140 7.4.1 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.4.2 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.4.3 Example: block-cyclic matrix multiplication (MM) . . . . . 142 7.4.4 Tasks and operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.6 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
This chapter proposes to draw upon several development methodologies to obtain efficient codes in classical scientific applications. Those methodologies are based on the feedback from several research works involving GPUs, either in a single machine or in a cluster of machines. Indeed, our past collaborations with industries have allowed us to point out that in their economical context, they can adopt a parallel technology only if its implementation and maintenance costs are small compared with the potential benefits (performance, accuracy, etc.). So, in such contexts, GPU programming is still regarded with some distance due to its specific field of applicability (SIMD/SIMT model: Single Instruction Multiple Data/Thread) and its still higher programming complexity and maintenance. In the academic domain, things are a bit different, but studies for efficiently integrating GPU computations in multicore clusters with maximal overlapping of computations with communications and/or other computations are still rare.