The pursuit of better performance

Our implementation code performs the optimal operation. However, we can make further optimizations by reducing the shared memory's bank conflicts. In our implementation, the CUDA threads access the same memory banks at certain points. NVIDIA's GPU Gem3 introduced prefix-sum (scan) in Chapter 39, Parallel Prefix Sum (Scan) with CUDA (https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html), and points out this issue in 39.2.3 Avoiding Bank Conflicts. You can adapt the solution to our implementation, but you should update NUM_BANKS to 32 and LOG_NUM_BANKS to 5 if you do. Nowadays, the CUDA architecture has 32 shared memory banks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.182.179