ABSTRACT

This chapter presents optimized solutions for non-separable two-dimensional (2D), two-dimensional (3D), and 4D convolution with the CUDA programming language, using floats and the fast shared memory. Filtering is an important step in many image processing applications such as image denoising, image registration, and image segmentation. The CUDA SDK contains two examples of separable 2D convolution, one example of fast Fourier transform-based filtering in 2D, and a single example of separable 3D convolution. The implementations presented have been made with CUDA 5.0 and are optimized for the Nvidia GTX 680 graphics card. A substantial body of work has addressed the acceleration of filtering using Graphics processing units. The chapter deals with a CUDA implementation for non-separable 2D convolution that uses texture memory, as the texture memory cache can speed up local reads. While separable filters are less computationally demanding, there are a number of image processing operations that can only be performed using non-separable filters.