Appendix 447
time, which is critical for real-time rendering environments. It can also be used as
an external sorter, where inputs are streamed from the disk and each core per-
forms the radix sort independently. The merging step would be executed on a
single core, but the I/O is purely sequential. The merger is very efficient; it took
6 milliseconds to merge one million items on a single core of a 2.4-GHz Q6600.
It can be further improved by simple loop unrolling since the number of tourna-
ments is always equal to the height of the tree.
This chapter is heavily influenced by research in cache-aware and cache-
oblivious sorting. Interested readers are encouraged to follow up with excellent
works like Brodal et al. [2008], where funnel sort is introduced. While this work
is independently conceived, a paper by Satish et al. [2009] shares a lot of the
same ideas and shows great results by carefully using radix sort to get great par-
allelism on a GPU.
Appendix
Radix sort is a refinement of distribution sort. If we have a list of n input bytes,
we need 256 bins of size n to distribute the inputs into, since in the worst case, all
the inputs can be the same byte value. Of course, we could have dynamically
grown each of the bins, but that would incur a large overhead in calls to the
memory allocator and also fragments our heap. The solution is a two-pass ap-
proach, where we allocate 256 counters and only count how many items belong
in each bin during the first pass. Next, we form the prefix sum within the array of
counters and use that to distribute each input digit to an output buffer of size n.
We have just discovered counting sort.
The most commonly used form of radix sort is called the least-significant-
digit sort. That is, we start from the least-significant digit and work ourselves
towards more significant digits. For a 32-bit integer, the most common practice is
to use radix-256—i.e., we break the integer into four 8-bit digits and perform the
sort in four passes. For each pass, we perform the above counting sort. Since
counting sort is a stable sort, the ordering of a pass is preserved in the later
passes.
Acknowledgements
We would like to thank Michael Herf for his permission to include his radix sort code and
for inventing the bit hack for floating-point values. We would also like to thank Warren
Hunt for supplying his excellent radix sort implementation for another project that uses
negative indexing and the improved version of the bit hack.
448 28.ACacheAwareHybridSorter
References
[Brodal et al. 2008] Gerth Stølting Brodal, Rolf Fagerberg, and Kristoffer Vinther. “En-
gineering a Cache-Oblivious Sorting Algorithm.” Journal of Experimental Algo-
rithms 12 (June 2008).
[Herf 2001] Michael Herf. “Radix Tricks.” 2001. Available at http://www.stereopsis.
com/radix.html.
[Knuth 1973] Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and
Searching. Reading, MA: Addison-Wesley, 1973.
[Garanzha and Loop 2010] Kirill Garanzha and Charles Loop. “Fast Ray Sorting and
Breadth-First Packet Traversal for GPU Ray Tracing.” Computer Graphics Forum
29:2 (2010).
[Lin 2000] Ming C. Lin. “Fast Proximity Queries for Large Game Environments.” Game
Developers Conference Course Notes. Available at http://www.cs.unc.edu/~lin/
gdc2000_files/frame.htm.
[Marin 1997] Mauricio Marin. “Priority Queues Operations on EREW-PRAM.” Pro-
ceedings of Euro-Par ’97 Parallel Processing. Springer, 1997.
[Nyberg et al. 1995] Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, and Dave
Lomet. “AlphaSort: A Cache Sensitive Parallel External Sort.” The VLDB Journal
4:4 (October 1995), pp. 603–627.
[Patney et al. 2010] Anjul Patney, Stanley Tzeng, and John D. Owens. “Fragment-
Parallel Composite and Filter.” Computer Graphics Forum 29:4 (June 2010), pp.
1251–1258.
[Satish et al. 2009] Nadathur Satish, Mark Harris, and Michael Garland. “Designing Ef-
ficient Sorting Algorithms for Manycore GPUs.” Proceedings of the 2009 IEEE
International Symposium on Parallel & Distributed Processing.
[Terdiman 2000] Pierre Terdiman. “Radix Sort Revisited.” April 1, 2000. Available at
http://codercorner.com/RadixSortRevisited.htm.
[Wright 2006] Christopher Wright. “Using CPUID for SIMD Detection.” 2006. Availa-
ble at http://softpixel.com/~cwright/programming/simd/cpuid.php.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.88.62