Efficient Parallel Scan Algorithms for Manycore GPUs
FIGURE 19.1: A simple memory access pattern in which each processor reads a contiguous bounded neighborhood of input (each neighborhood has a different hatching pattern) and produces one output item.
We have witnessed a phenomenal increase in computational resources for graphics processors units (GPU) over the last few years. The highest performing graphics processors from both ATI and NVIDIA already have billions of transistors, resulting in more than a teraflop of peak processing power. This incredible processing power comes from the presence of hundreds of processing cores, all on the same chip.