BLAS for GPUs

doi:10.1201/b10376-13

ABSTRACT

Rajib Nath Department of Electrical Engineering and Computer Science, University of Tennessee

Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee Computer Science and Mathematics Division, Oak Ridge National Laboratory School of Mathematics & School of Computer Science, Manchester University

Recent activities of major chip manufacturers, such as Intel, AMD, IBM and NVIDIA, make it more evident than ever that future designs of microprocessors and large HPC systems will be hybrid/heterogeneous in nature, relying on the integration (in varying proportions) of two major types of components:

1. Multi-/many-cores CPU technology, where the number of cores will con-

tinue to escalate while avoiding the power wall, instruction level parallelism wall, and the memory wall [1]; and

2. Special purpose hardware and accelerators, especially GPUs, which are in commodity production, have outpaced standard CPUs in performance, and have become as easy-if not easier-to program than multicore CPUs.