Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Table of Contents

Target Audience

How to Use the Book

Online Supplements

Acknowledgements

Chapter 1. Introduction

1.1 Heterogeneous Parallel Computing

1.2 Architecture of a Modern GPU

1.3 Why More Speed or Parallelism?

1.4 Speeding Up Real Applications

1.5 Parallel Programming Languages and Models

1.6 Overarching Goals

1.7 Organization of the Book

Chapter 2. History of GPU Computing

2.1 Evolution of Graphics Pipelines

2.2 GPGPU: An Intermediate Step

2.3 GPU Computing

References and Further Reading

Chapter 3. Introduction to Data Parallelism and CUDA C

3.1 Data Parallelism

3.2 CUDA Program Structure

3.3 A Vector Addition Kernel

3.4 Device Global Memory and Data Transfer

3.5 Kernel Functions and Threading

Chapter 4. Data-Parallel Execution Model

4.1 Cuda Thread Organization

4.2 Mapping Threads to Multidimensional Data

4.3 Matrix-Matrix Multiplication—A More Complex Kernel

4.4 Synchronization and Transparent Scalability

4.5 Assigning Resources to Blocks

4.6 Querying Device Properties

4.7 Thread Scheduling and Latency Tolerance

Chapter 5. CUDA Memories

5.1 Importance of Memory Access Efficiency

5.2 CUDA Device Memory Types

5.3 A Strategy for Reducing Global Memory Traffic

5.4 A Tiled Matrix–Matrix Multiplication Kernel

5.5 Memory as a Limiting Factor to Parallelism

Chapter 6. Performance Considerations

6.1 Warps and Thread Execution

6.2 Global Memory Bandwidth

6.3 Dynamic Partitioning of Execution Resources

6.4 Instruction Mix and Thread Granularity

Chapter 7. Floating-Point Considerations

7.1 Floating-Point Format

7.2 Representable Numbers

7.3 Special Bit Patterns and Precision in IEEE Format

7.4 Arithmetic Accuracy and Rounding

7.5 Algorithm Considerations

7.6 Numerical Stability

Chapter 8. Parallel Patterns: Convolution: With an Introduction to Constant Memory and Caches

8.2 1D Parallel Convolution—A Basic Algorithm

8.3 Constant Memory and Caching

8.4 Tiled 1D Convolution with Halo Elements

8.5 A Simpler Tiled 1D Convolution—General Caching

Chapter 9. Parallel Patterns: Prefix Sum: An Introduction to Work Efficiency in Parallel Algorithms

9.2 A Simple Parallel Scan

9.3 Work Efficiency Considerations

9.4 A Work-Efficient Parallel Scan

9.5 Parallel Scan for Arbitrary-Length Inputs

Chapter 10. Parallel Patterns: Sparse Matrix–Vector Multiplication: An Introduction to Compaction and Regularization in Parallel Algorithms

10.1 Background

10.2 Parallel SpMV Using CSR

10.3 Padding and Transposition

10.4 Using Hybrid to Control Padding

10.5 Sorting and Partitioning for Regularization

Chapter 11. Application Case Study: Advanced MRI Reconstruction

11.1 Application Background

11.2 Iterative Reconstruction

11.3 Computing FHD

11.4 Final Evaluation

Chapter 12. Application Case Study: Molecular Visualization and Analysis

12.1 Application Background

12.2 A Simple Kernel Implementation

12.3 Thread Granularity Adjustment

12.4 Memory Coalescing

Chapter 13. Parallel Programming and Computational Thinking

13.1 Goals of Parallel Computing

13.2 Problem Decomposition

13.3 Algorithm Selection

13.4 Computational Thinking

Chapter 14. An Introduction to OpenCLTM

14.1 Background

14.2 Data Parallelism Model

14.3 Device Architecture

14.4 Kernel Functions

14.5 Device Management and Kernel Launch

14.6 Electrostatic Potential Map in OpenCL

Chapter 15. Parallel Programming with OpenACC

15.1 OpenACC Versus CUDA C

15.2 Execution Model

15.3 Memory Model

15.4 Basic OpenACC Programs

15.5 Future Directions of OpenACC

Chapter 16. Thrust: A Productivity-Oriented Library for CUDA

16.1 Background

16.2 Motivation

16.3 Basic Thrust Features

16.4 Generic Programming

16.5 Benefits of Abstraction

16.6 Programmer Productivity

16.7 Best Practices

Chapter 17. CUDA FORTRAN

17.1 CUDA FORTRAN and CUDA C Differences

17.2 A First CUDA FORTRAN Program

17.3 Multidimensional Array in CUDA FORTRAN

17.4 Overloading Host/Device Routines With Generic Interfaces

17.5 Calling CUDA C Via Iso_C_Binding

17.6 Kernel Loop Directives and Reduction Operations

17.7 Dynamic Shared Memory

17.8 Asynchronous Data Transfers

17.9 Compilation and Profiling

17.10 Calling Thrust from CUDA FORTRAN

17.11 Exercises

Chapter 18. An Introduction to C++ AMP

18.1 Core C++ AMP Features

18.2 Details of the C++ AMP Execution Model

18.3 Managing Accelerators

18.4 Tiled Execution

18.5 C++ AMP Graphics Features

Chapter 19. Programming a Heterogeneous Computing Cluster

19.1 Background

19.2 A Running Example

19.3 MPI Basics

19.4 MPI Point-to-Point Communication Types

19.5 Overlapping Computation and Communication

19.6 MPI Collective Communication

Chapter 20. CUDA Dynamic Parallelism

20.1 Background

20.2 Dynamic Parallelism Overview

20.3 Important Details

20.4 Memory Visibility

20.5 A Simple Example

20.6 Runtime Limitations

20.7 A More Complex Example

Chapter 21. Conclusion and Future Outlook

21.1 Goals Revisited

21.2 Memory Model Evolution

21.3 Kernel Execution Control Evolution

21.4 Core Performance

21.5 Programming Environment

21.6 Future Outlook

Appendix A. Matrix Multiplication Host-Only Version Source Code

Appendix Outline

A.1 matrixmul.cu

A.2 matrixmul_gold.cpp

A.3 matrixmul.h

A.5 Expected Output

Appendix B. GPU Compute Capabilities

Appendix Outline

B.1 GPU Compute Capability Tables

B.2 Memory Coalescing Variations

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

18.188.190.175