Figure 4.1 |
Figure 4.2 |
Architecture | Intel Nehalem x5550 | NVIDIA T10P C1060 | NVIDIA GT200 GTX 285 | NVIDIA Fermi C2050 |
---|---|---|---|---|
GHz | 2.66 | 1.44 | 1.47 | 1.15 |
Sockets | 2 | 1 | 1 | 1 |
Cores/socket (SM/GPU) | 4 | (30) | (30) | (14) |
Peak Gflop (single) | 170.6 | 933 | 1060 | 1030 |
Peak Gflop (double) | 85.3 | 78 | 88 | 515 |
Peak GB/s | 51.2 | 102 | 159 | 144 |
Sockets only watts | 200 | 200 | 204 | 247 |
64-bit flops/watt | 0.4265 | 0.39 | 0.431372549 | 2.08502 |
32-bit flops/watt | 0.853 | 4.665 | 5.196078431 | 4.17004 |
active warps/active cycle | The average number of warps that are active on a multiprocessor per cycle, which is calculated as: (active warps)/(active cycles) |
Divergent branches (%) | The percentage of branches that are causing divergence within a warp amongst all the branches present in the kernel. Divergence within a warp causes serialization in execution. This is calculated as: (100 * divergent branch)/(divergent branch + branch) |
Control flow divergence (%) | Control flow divergence gives the percentage of thread instructions that were not executed by all threads in the warp, hence causing divergence. This should be as low as possible. This is calculated as: 100 * ((32 * instructions executed)—threads instruction executed)/(32 * instructions executed)) |
Achieved kernel occupancy | This ratio provides the actual occupancy of the kernel based on the number of warps executing per cycle on the SM. It is the ratio of active warps and active cycles divided by the max number of warps that can execute on an SM. This is calculated as: (active warps/active cycles)/4 |
Maximum Occupancy | Maximum Registers | Increase | |
---|---|---|---|
GF100 | 20 at 100% occupancy | 63 at 33% occupancy | 3x more registers per thread |
GF200 | 16 at 100% occupancy | ≈128 at 12.5% occupancy | 8x more registers per thread |
Figure 4.3 |
Thread 1 | Thread 2 | Thread 3 | Thread 4 |
---|---|---|---|
x = x + c | y = y + c | z = z + c | w = w + c |
x = x + b | y = y + b | z = z + b | w = w + b |
x = x + a | y = y + a | z = z + a | w = w + a |
Thread | ||
---|---|---|
Instructions -> | w = w + b | Four independent operations |
z = z + b | ||
y = y + b | ||
x = x + b | ||
w = w + a | Four independent operations | |
z = z + a | ||
y = y + a | ||
x = x + a |
Compute Generation | GPU Architecture | Latency (Cycles) | Throughput (Cores/SM) | Parallelism (Operations/SM) |
---|---|---|---|---|
Compute 1.x | G80-GT200 | ≈24 | 8 | ≈192 |
Compute 2.0 | GF100 | ≈18 | 32 | ≈576 |
Compute 2.1 | GF104 | ≈18 | 48 | ≈864 |
Instruction throughput | This value is the ratio of achieved instruction rate to peak single-issue instruction rate. The achieved instruction rate is calculated using the profiler counter “instructions.” The peak instruction rate is calculated based on the GPU clock speed. In the case of instruction dual-issue coming into play, this ratio shoots up to greater than 1. This is calculated as: (instructions)/(gpu_time * clock_frequency) |
Ideal instruction/byte ratio | This value is a ratio of the peak instruction throughput and the peak memory throughput of the CUDA device. This is a property of the device and is independent of the kernel. |
Instruction/byte | This value is the ratio of the total number of instructions issued by the kernel and the total number of bytes accessed by the kernel from global memory. If this ratio is greater than the ideal instruction/byte ratio, then the kernel is compute-bound, and if it's less, then the kernel is memory-bound. This is calculated as: (32 * instructions issued * #SM)/{32 * (l2 read requests + l2 write requests + l2 read texture requests)} |
IPC (instructions per cycle) | This value gives the number of instructions issued per cycle. This should be compared to maximum IPC possible for the device. The range provided is for single-precision floating-point instructions. This is calculated as: (instructions issued/active cycles) |
Replayed instructions (%) | This value gives the percentage of instructions replayed during kernel execution. Replayed instructions are the difference between the numbers of instructions that are actually issued by the hardware to the number of instructions that are to be executed by the kernel. Ideally, this should be zero. This is calculated as: 100 * (instructions issued—instruction executed)/instruction issue |
Latency | Throughput | Parallelism | |
---|---|---|---|
Arithmetic | ≈18 | 32 | ≈576 |
Memory | <800 cycles | <177 GB.s | <100 KB |
Figure 4.4 |
3.15.12.34