ABSTRACT

This chapter examines how the Graphics Processing Unit (GPU) actually executes the blocks. It explores how the concept of warp ties to the design of the GPU cores and their placement inside a streaming multiprocessor (SM). The chapter presents many different versions of the kernels inside the imflipG.cu and imedgeG.cu programs, run them, and observe their performance. It then runs these experiments in four different GPU architecture families: Fermi, Kepler, Maxwell, and Pascal. The most noticeable difference of Kepler SMX structure — as compared to Fermi — is the introduction of the double precision units (DPU). Every Compute-Unified Device Architecture (CUDA) core is designed to work at a base clock frequency and a boost clock frequency. For example, the GTX Titan Z has a base clock frequency of 705 MHz and a boost clock frequency of 875 MHz, which is 24% higher.