ABSTRACT

This chapter focuses on the parts of the Graphics Processing Unit (GPU) programming that involves both the CPU and GPU, which are the launch dimensions of a GPU kernel, Peripheral Component Interconnect (PCI) Express bandwidth and its impact on the overall performance, and the memory bandwidth of the CPU and the GPU. It provides an example output of the imflipG.cu, when run with the 'V' option — for vertical flip — and a block size of 256, using the astronaut.bmp image. The chapter focuses on the kernel execution time. A grid is a bunch of blocks, arranged in a 1D or 2D fashion. Blocks are the unit element of launch. The way a GPU programmer conceptualizes a program is that a giant tasks gets chopped up into blocks that can execute independently. When programmers study Compute-Unified Device Architecture (CUDA) assembly language they will find that a thread is called a lane in the Parallel Thread Execution (PTX) language.