0%

Book Description

Explore different GPU programming methods using libraries and directives, such as OpenACC, with extension to languages such as C, C++, and Python

Key Features

  • Learn parallel programming principles and practices and performance analysis in GPU computing
  • Get to grips with distributed multi GPU programming and other approaches to GPU programming
  • Understand how GPU acceleration in deep learning models can improve their performance

Book Description

Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning.

Learn CUDA Programming will help you learn GPU parallel programming and understand its modern applications. In this book, you'll discover CUDA programming approaches for modern GPU architectures. You'll not only be guided through GPU features, tools, and APIs, you'll also learn how to analyze performance with sample parallel programming algorithms. This book will help you optimize the performance of your apps by giving insights into CUDA programming platforms with various libraries, compiler directives (OpenACC), and other languages. As you progress, you'll learn how additional computing power can be generated using multiple GPUs in a box or in multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

By the end of this CUDA book, you'll be equipped with the skills you need to integrate the power of GPU computing in your applications.

What you will learn

  • Understand general GPU operations and programming patterns in CUDA
  • Uncover the difference between GPU programming and CPU programming
  • Analyze GPU application performance and implement optimization strategies
  • Explore GPU programming, profiling, and debugging tools
  • Grasp parallel programming algorithms and how to implement them
  • Scale GPU-accelerated applications with multi-GPU and multi-nodes
  • Delve into GPU programming platforms with accelerated libraries, Python, and OpenACC
  • Gain insights into deep learning accelerators in CNNs and RNNs using GPUs

Who this book is for

This beginner-level book is for programmers who want to delve into parallel computing, become part of the high-performance computing community and build modern applications. Basic C and C++ programming experience is assumed. For deep learning enthusiasts, this book covers Python InterOps, DL libraries, and practical examples on performance estimation.

Downloading the example code for this ebook: You can download the example code files for this ebook on GitHub at the following link: https://github.com/PacktPublishing/-Hands-On-GPU-programming-with-CUDA. If you require support please email: [email protected]

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Learn CUDA Programming
  3. Dedication
  4. About Packt
    1. Why subscribe?
  5. Contributors
    1. About the authors
    2. About the reviewers
    3. Packt is searching for authors like you
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  7. Introduction to CUDA Programming
    1. The history of high-performance computing
      1. Heterogeneous computing
      2. Programming paradigm
      3. Low latency versus higher throughput
      4. Programming approaches to GPU
    2. Technical requirements 
    3. Hello World from CUDA
      1. Thread hierarchy
      2. GPU architecture
    4. Vector addition using CUDA
      1. Experiment 1 – creating multiple blocks
      2. Experiment 2 – creating multiple threads
      3. Experiment 3 – combining blocks and threads
      4. Why bother with threads and blocks?
      5. Launching kernels in multiple dimensions
    5. Error reporting in CUDA
    6. Data type support in CUDA
    7. Summary
  8. CUDA Memory Management
    1. Technical requirements 
    2. NVIDIA Visual Profiler
    3. Global memory/device memory
      1. Vector addition on global memory
      2. Coalesced versus uncoalesced global memory access
      3. Memory throughput analysis
    4. Shared memory
      1. Matrix transpose on shared memory
      2. Bank conflicts and its effect on shared memory
    5. Read-only data/cache
      1. Computer vision – image scaling using texture memory
    6. Registers in GPU
    7. Pinned memory
      1. Bandwidth test – pinned versus pageable
    8. Unified memory
      1. Understanding unified memory page allocation and transfer
      2. Optimizing unified memory with warp per page
      3. Optimizing unified memory using data prefetching
    9. GPU memory evolution
      1. Why do GPUs have caches?
    10. Summary
  9. CUDA Thread Programming
    1. Technical requirements
    2. CUDA threads, blocks, and the GPU
      1. Exploiting a CUDA block and warp
    3. Understanding CUDA occupancy
      1. Setting NVCC to report GPU resource usages
        1. The settings for Linux
        2. Settings for Windows
      2. Analyzing the optimal occupancy using the Occupancy Calculator
      3. Occupancy tuning – bounding register usage
      4. Getting the achieved occupancy from the profiler
    4. Understanding parallel reduction
      1. Naive parallel reduction using global memory
      2. Reducing kernels using shared memory
      3. Writing performance measurement code
      4. Performance comparison for the two reductions – global and shared memory
    5. Identifying the application's performance limiter
      1. Finding the performance limiter and optimization
    6. Minimizing the CUDA warp divergence effect
      1. Determining divergence as a performance bottleneck
        1. Interleaved addressing
        2. Sequential addressing
    7. Performance modeling and balancing the limiter
      1. The Roofline model
      2. Maximizing memory bandwidth with grid-strided loops
      3. Balancing the I/O throughput
    8. Warp-level primitive programming
      1. Parallel reduction with warp primitives
    9. Cooperative Groups for flexible thread handling
      1. Cooperative Groups in a CUDA thread block
      2. Benefits of Cooperative Groups
        1. Modularity
        2. Explicit grouped threads' operation and race condition avoidance
        3. Dynamic active thread selection
        4. Applying to the parallel reduction
        5. Cooperative Groups to avoid deadlock
    10. Loop unrolling in the CUDA kernel
    11. Atomic operations
    12. Low/mixed precision operations
      1. Half-precision operation
      2. Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)
      3. Measuring the performance
    13. Summary
  10. Kernel Execution Model and Optimization Strategies
    1. Technical requirements
    2. Kernel execution with CUDA streams
      1. The usage of CUDA streams
      2. Stream-level synchronization
      3. Working with the default stream
    3. Pipelining the GPU execution
      1. Concept of GPU pipelining
      2. Building a pipelining execution
    4. The CUDA callback function
    5. CUDA streams with priority
      1. Priorities in CUDA
      2. Stream execution with priorities
    6. Kernel execution time estimation using CUDA events
      1. Using CUDA events
      2. Multiple stream estimation
    7. CUDA dynamic parallelism
      1. Understanding dynamic parallelism
      2. Usage of dynamic parallelism
      3. Recursion
    8. Grid-level cooperative groups
      1. Understanding grid-level cooperative groups
      2. Usage of grid_group
    9. CUDA kernel calls with OpenMP
      1. OpenMP and CUDA calls
      2. CUDA kernel calls with OpenMP
    10. Multi-Process Service
      1. Introduction to Message Passing Interface
      2. Implementing an MPI-enabled application
      3. Enabling MPS
      4. Profiling an MPI application and understanding MPS operation
    11. Kernel execution overhead comparison
      1. Implementing three types of kernel executions
      2. Comparison of three executions
    12. Summary 
  11. CUDA Application Profiling and Debugging
    1. Technical requirements
    2. Profiling focused target ranges in GPU applications
      1. Limiting the profiling target in code
      2. Limiting the profiling target with time or GPU
    3. Profiling with NVTX
    4. Visual profiling against the remote machine
    5. Debugging a CUDA application with CUDA error
    6. Asserting local GPU values using CUDA assert
    7. Debugging a CUDA application with Nsight Visual Studio Edition
    8. Debugging a CUDA application with Nsight Eclipse Edition
    9. Debugging a CUDA application with CUDA-GDB
      1. Breakpoints of CUDA-GDB
      2. Inspecting variables with CUDA-GDB
        1. Listing kernel functions
        2. Variables investigation
    10. Runtime validation with CUDA-memcheck
      1. Detecting memory out of bounds
      2. Detecting other memory errors
    11. Profiling GPU applications with Nsight Systems
    12. Profiling a kernel with Nsight Compute
      1. Profiling with the CLI
      2. Profiling with the GUI
        1. Performance analysis report
        2. Baseline compare
        3. Source view
    13. Summary
  12. Scalable Multi-GPU Programming
    1. Technical requirements 
    2. Solving a linear equation using Gaussian elimination
      1. Single GPU hotspot analysis of Gaussian elimination
    3. GPUDirect peer to peer
      1. Single node – multi-GPU Gaussian elimination
    4. Brief introduction to MPI
    5. GPUDirect RDMA
      1. CUDA-aware MPI
      2. Multinode – multi-GPU Gaussian elimination
    6. CUDA streams
      1. Application 1 – using multiple streams to overlap data transfers with kernel execution
      2. Application 2 – using multiple streams to run kernels on multiple devices
    7. Additional tricks
      1. Benchmarking an existing system with an InfiniBand network card
      2. NVIDIA Collective Communication Library (NCCL)
        1. Collective communication acceleration using NCCL
    8. Summary
  13. Parallel Programming Patterns in CUDA
    1. Technical requirements
    2. Matrix multiplication optimization
      1. Implementation of the tiling approach
      2. Performance analysis of the tiling approach
    3. Convolution
      1. Convolution operation in CUDA
      2. Optimization strategy
      3. Filtering coefficients optimization using constant memory
      4. Tiling input data using shared memory
      5. Getting more performance
    4. Prefix sum (scan)
      1. Blelloch scan implementation
      2. Building a global size scan
      3. The pursuit of better performance
      4. Other applications for the parallel prefix-sum operation
    5. Compact and split
      1. Implementing compact
      2. Implementing split
    6. N-body
      1. Implementing an N-body simulation on GPU
      2. Overview of an N-body simulation implementation
    7. Histogram calculation
      1. Compile and execution steps
      2. Understanding a parallel histogram 
      3. Calculating a histogram with CUDA atomic functions
    8. Quicksort in CUDA using dynamic parallelism
      1. Quicksort and CUDA dynamic parallelism 
      2. Quicksort with CUDA
      3. Dynamic parallelism guidelines and constraints
    9. Radix sort
      1. Two approaches
      2. Approach 1 – warp-level primitives
      3. Approach 2 – Thrust-based radix sort
    10. Summary
  14. Programming with Libraries and Other Languages
    1. Linear algebra operation using cuBLAS
      1. cuBLAS SGEMM operation
      2. Multi-GPU operation
    2. Mixed-precision operation using cuBLAS
      1. GEMM with mixed precision
      2. GEMM with TensorCore
    3. cuRAND for parallel random number generation
      1. cuRAND host API
      2. cuRAND device API
      3. cuRAND with mixed precision cuBLAS GEMM
    4. cuFFT for Fast Fourier Transformation in GPU
      1. Basic usage of cuFFT
      2. cuFFT with mixed precision
      3. cuFFT for multi-GPU
    5. NPP for image and signal processing with GPU
      1. Image processing with NPP
      2. Signal processing with NPP
      3. Applications of NPP
    6. Writing GPU accelerated code in OpenCV
      1. CUDA-enabled OpenCV installation
      2. Implementing a CUDA-enabled blur filter
      3. Enabling multi-stream processing
    7. Writing Python code that works with CUDA
      1. Numba – a high-performance Python compiler
        1. Installing Numba
        2. Using Numba with the @vectorize decorator
        3. Using Numba with the @cuda.jit decorator
      2. CuPy – GPU accelerated Python matrix library 
        1. Installing CuPy
        2. Basic usage of CuPy
        3. Implementing custom kernel functions
      3. PyCUDA – Pythonic access to CUDA API
      4. Installing PyCUDA
        1. Matrix multiplication using PyCUDA
    8. NVBLAS for zero coding acceleration in Octave and R
      1. Configuration
      2. Accelerating Octave's computation
      3. Accelerating R's compuation
    9. CUDA acceleration in MATLAB
    10. Summary
  15. GPU Programming Using OpenACC
    1. Technical requirements
      1. Image merging on a GPU using OpenACC
    2. OpenACC directives
      1. Parallel and loop directives
      2. Data directive
      3. Applying the parallel, loop, and data directive to merge image code
    3. Asynchronous programming in OpenACC
      1. Structured data directive
      2. Unstructured data directive
      3. Asynchronous programming in OpenACC
      4. Applying the unstructured data and async directives to merge image code
    4. Additional important directives and clauses
      1. Gang/vector/worker
      2. Managed memory
      3. Kernel directive
      4. Collapse clause
      5. Tile clause
      6. CUDA interoperability
        1. DevicePtr clause
        2. Routine directive
    5. Summary
  16. Deep Learning Acceleration with CUDA
    1. Technical requirements
    2. Fully connected layer acceleration with cuBLAS 
      1. Neural network operations
      2. Design of a neural network layer
      3. Tensor and parameter containers
      4. Implementing a fully connected layer
        1. Implementing forward propagation
        2. Implementing backward propagation
      5. Layer termination
    3. Activation layer with cuDNN
      1. Layer configuration and initialization
      2. Implementing layer operation
        1. Implementing forward propagation
        2. Implementing backward propagation
    4. Softmax and loss functions in cuDNN/CUDA
      1. Implementing the softmax layer
        1. Implementing forward propagation
        2. Implementing backward propagation
      2. Implementing the loss function
      3. MNIST dataloader
      4. Managing and creating a model
      5. Network training with the MNIST dataset
    5. Convolutional neural networks with cuDNN
      1. The convolution layer
        1. Implementing forward propagation
        2. Implementing backward propagation
      2. Pooling layer with cuDNN
        1. Implementing forward propagation
        2. Implementing backward propagation
      3. Network configuration
      4. Mixed precision operations
    6. Recurrent neural network optimization
      1. Using the CUDNN LSTM operation
      2. Implementing a virtual LSTM operation
      3. Comparing the performance between CUDNN and SGEMM LSTM
    7. Profiling deep learning frameworks
      1. Profiling the PyTorch model
      2. Profiling a TensorFlow model
    8. Summary
  17. Appendix
    1. Useful nvidia-smi commands
      1. Getting the GPU's information 
      2. Getting formatted information
      3. Power management mode settings
      4. Setting the GPU's clock speed
      5. GPU device monitoring
      6. Monitoring GPU utilization along with multiple processes
      7. Getting GPU topology information
    2. WDDM/TCC mode in Windows
      1. Setting TCC/WDDM mode
    3. Performance modeling 
      1. The Roofline model
      2. Analyzing the Jacobi method
    4. Exploring container-based development
      1. NGC configuration for a host machine
      2. Basic usage of the NGC container
      3. Creating and saving a new container from the NGC container
      4. Setting the default runtime as NVIDIA Docker
  18. Another Book You May Enjoy
    1. Leave a review - let other readers know what you think
13.59.130.130