ABSTRACT

Sorting is one of the most basic building blocks of many algorithms. In graphics, a sort is commonly used for depth-sorting for transparency [Patney et al. 2010] or to get better Z-cull performance. It is a key part of collision detection [Lin 2000]. Dynamic state sorting is critical for minimizing state changes in a scene graph renderer. Recently, Garanzha and Loop [2010] demonstrated that it is highly profitable to buffer and sort rays within a ray tracer to extract better coherency, which is key to high GPU performance. Ray sorting is one example of a wellknown practice in scientific computing, where parallel sorts are used to handle irregular communication patterns and workloads. Well, can’t we just use the standard template library (STL) sort? We can, but we can also do better. How about up to six times better? Quicksort is probably the best comparison-based sort and, on average, works well. However, its worstcase behavior can be  2O n , and its memory access pattern is not very cachefriendly. Radix sort is the only practical  O n sort out there (see the appendix for a quick overview of radix sort). Its memory access pattern during the first pass, where we are building counts, is very cache-friendly. However, the final output phase uses random scatter writes. Is there a way for us to use radix sort but minimize its weaknesses? Modern parallel external sort (e.g., AlphaSort [Nyberg et al. 1995]) almost always uses a two-pass approach of in-memory sort followed by a merge. Each item only has to be read from disk twice. More importantly, the merge phase is very I/O friendly since the access is purely sequential. Substitute “disk” with “main memory” and “memory” with “cache,” and the same considerations apply-we want to minimize reads from main memory and also love the sequential access pattern of the merge phase.