Questions

In the atomic operations example, try changing the grid size from 1 to 2 before the kernel is launched while leaving the total block size at 100. If this gives you the wrong output for add_out (anything other than 200), then why is it wrong, considering that atomicExch is thread-safe?
In the atomic operations example, try removing __syncthreads, and then run the kernel over the original parameters of grid size 1 and block size 100. If this gives you the wrong output for add_out (anything other than 100), then why is it wrong, considering that atomicExch is thread-safe?
Why do we not have to use __syncthreads to synchronize over a block of size 32 or less?

We saw that sum_ker is around five times faster than PyCUDA's sum operation for random-valued arrays of length 640,000 (10000*2*32). If you try adding a zero to the end of this number (that is, multiply it by 10), you'll notice that the performance drops to the point where sum_ker is only about 1.5 times as fast as PyCUDA's sum. If you add another zero to the end of that number, you'll notice that sum_ker is only 75% as fast as PyCUDA's sum. Why do you think this is the case? How can we improve sum_ker to be faster on larger arrays?
Which algorithm performs more addition operations (counting both calls to the C + operator and atomicSum as a single operation): sum_ker or PyCUDA's sum?

Table of Contents for Questions