ABSTRACT

Contemporary graphics processing units (GPUs) are massively parallel many-core processors. NVIDIA’s Tesla GPUs, for example, have 240 scalar processing cores (SPs) per chip [22]. These cores are partitioned into 30 streaming multiprocessors (SMs) with each SM comprised of eight SPs. Each SM shares a 16-KB local memory (called shared memory) and has a total of 16,384 32-bit registers that may be utilized by the threads running on this SM. Besides registers and shared memory, on-chip memory shared by the cores in an SM also includes constant and texture caches. The 240 on-chip cores also share a 4-GB off-chip global (or device) memory. Figure 7.1 shows a schematic of the Tesla architecture. With the introduction of CUDA (Compute Unified Driver Architecture) [35], it has become possible to program GPUs using C. This has resulted in an explosion of research directed toward expanding the applicability of GPUs from their native computer graphics applications to a wide variety of high-performance computing applications.