ABSTRACT

Universite´ Lorraine, Loria UMR 7503 & AlGorille INRIA Project Team, Nancy, France

Stephane Vialle

SUPELEC, UMI GT-CNRS 2958 & AlGorille INRIA Project Team, Metz, France

Jens Gustedt

INRIA Nancy-Grand Est, AlGorille INRIA Project Team, Strasbourg, France

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2 General scheme of synchronous code with

computation/communication overlapping in GPU clusters . . . . . 106 7.2.1 Synchronous parallel algorithms on GPU clusters . . . . . . 106 7.2.2 Native overlap of CPU communications and GPU

computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.2.3 Overlapping with sequences of transfers and

computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2.4 Interleaved communications-transfers-computations

overlapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2.5 Experimental validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.3 General scheme of asynchronous parallel code with computation/communication overlapping . . . . . . . . . . . . . . . . . . . . . . . 120 7.3.1 A basic asynchronous scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3.2 Synchronization of the asynchronous scheme . . . . . . . . . . . 126 7.3.3 Asynchronous scheme using MPI, OpenMP, and CUDA 130 7.3.4 Experimental validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.4 Perspective: a unifying programming model . . . . . . . . . . . . . . . . . . . . . 140 7.4.1 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.4.2 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.4.3 Example: block-cyclic matrix multiplication (MM) . . . . . 142 7.4.4 Tasks and operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.6 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

This chapter proposes to draw upon several development methodologies to obtain efficient codes in classical scientific applications. Those methodologies are based on the feedback from several research works involving GPUs, either in a single machine or in a cluster of machines. Indeed, our past collaborations with industries have allowed us to point out that in their economical context, they can adopt a parallel technology only if its implementation and maintenance costs are small compared with the potential benefits (performance, accuracy, etc.). So, in such contexts, GPU programming is still regarded with some distance due to its specific field of applicability (SIMD/SIMT model: Single Instruction Multiple Data/Thread) and its still higher programming complexity and maintenance. In the academic domain, things are a bit different, but studies for efficiently integrating GPU computations in multicore clusters with maximal overlapping of computations with communications and/or other computations are still rare.