1.1 Heterogeneous Parallel Computing
1.2 Architecture of a Modern GPU
1.3 Why More Speed or Parallelism?
1.4 Speeding Up Real Applications
1.5 Parallel Programming Languages and Models
Chapter 2. History of GPU Computing
2.1 Evolution of Graphics Pipelines
2.2 GPGPU: An Intermediate Step
References and Further Reading
Chapter 3. Introduction to Data Parallelism and CUDA C
3.4 Device Global Memory and Data Transfer
3.5 Kernel Functions and Threading
Chapter 4. Data-Parallel Execution Model
4.2 Mapping Threads to Multidimensional Data
4.3 Matrix-Matrix Multiplication—A More Complex Kernel
4.4 Synchronization and Transparent Scalability
4.5 Assigning Resources to Blocks
4.6 Querying Device Properties
4.7 Thread Scheduling and Latency Tolerance
5.1 Importance of Memory Access Efficiency
5.3 A Strategy for Reducing Global Memory Traffic
5.4 A Tiled Matrix–Matrix Multiplication Kernel
5.5 Memory as a Limiting Factor to Parallelism
Chapter 6. Performance Considerations
6.1 Warps and Thread Execution
6.3 Dynamic Partitioning of Execution Resources
6.4 Instruction Mix and Thread Granularity
Chapter 7. Floating-Point Considerations
7.3 Special Bit Patterns and Precision in IEEE Format
7.4 Arithmetic Accuracy and Rounding
Chapter 8. Parallel Patterns: Convolution: With an Introduction to Constant Memory and Caches
8.2 1D Parallel Convolution—A Basic Algorithm
8.3 Constant Memory and Caching
8.4 Tiled 1D Convolution with Halo Elements
8.5 A Simpler Tiled 1D Convolution—General Caching
Chapter 9. Parallel Patterns: Prefix Sum: An Introduction to Work Efficiency in Parallel Algorithms
9.3 Work Efficiency Considerations
9.4 A Work-Efficient Parallel Scan
9.5 Parallel Scan for Arbitrary-Length Inputs
10.3 Padding and Transposition
10.4 Using Hybrid to Control Padding
10.5 Sorting and Partitioning for Regularization
Chapter 11. Application Case Study: Advanced MRI Reconstruction
Chapter 12. Application Case Study: Molecular Visualization and Analysis
12.2 A Simple Kernel Implementation
12.3 Thread Granularity Adjustment
Chapter 13. Parallel Programming and Computational Thinking
13.1 Goals of Parallel Computing
Chapter 14. An Introduction to OpenCLTM
14.5 Device Management and Kernel Launch
14.6 Electrostatic Potential Map in OpenCL
Chapter 15. Parallel Programming with OpenACC
15.5 Future Directions of OpenACC
Chapter 16. Thrust: A Productivity-Oriented Library for CUDA
17.1 CUDA FORTRAN and CUDA C Differences
17.2 A First CUDA FORTRAN Program
17.3 Multidimensional Array in CUDA FORTRAN
17.4 Overloading Host/Device Routines With Generic Interfaces
17.5 Calling CUDA C Via Iso_C_Binding
17.6 Kernel Loop Directives and Reduction Operations
17.8 Asynchronous Data Transfers
17.9 Compilation and Profiling
17.10 Calling Thrust from CUDA FORTRAN
Chapter 18. An Introduction to C++ AMP
18.2 Details of the C++ AMP Execution Model
18.5 C++ AMP Graphics Features
Chapter 19. Programming a Heterogeneous Computing Cluster
19.4 MPI Point-to-Point Communication Types
19.5 Overlapping Computation and Communication
19.6 MPI Collective Communication
Chapter 20. CUDA Dynamic Parallelism
20.2 Dynamic Parallelism Overview
Chapter 21. Conclusion and Future Outlook
21.3 Kernel Execution Control Evolution
Appendix A. Matrix Multiplication Host-Only Version Source Code
Appendix B. GPU Compute Capabilities
B.1 GPU Compute Capability Tables
18.188.190.175