Title Page Copyright and Credits Hands-On GPU Programming with Python and CUDA Dedication About Packt Why subscribe? Packt.com Contributors About the author About the reviewer Packt is searching for authors like you Preface Who this book is for What this book covers To get the most out of this book Download the example code files Download the color images Conventions used Get in touch Reviews Why GPU Programming? Technical requirements Parallelization and Amdahl's Law Using Amdahl's Law The Mandelbrot set Profiling your code Using the cProfile module Summary Questions Setting Up Your GPU Programming Environment Technical requirements Ensuring that we have the right hardware Checking your hardware (Linux) Checking your hardware (windows) Installing the GPU drivers Installing the GPU drivers (Linux) Installing the GPU drivers (Windows) Setting up a C++ programming environment Setting up GCC, Eclipse IDE, and graphical dependencies (Linux) Setting up Visual Studio (Windows) Installing the CUDA Toolkit Installing the CUDA Toolkit (Linux) Installing the CUDA Toolkit (Windows) Setting up our Python environment for GPU programming Installing PyCUDA (Linux) Creating an environment launch script (Windows) Installing PyCUDA (Windows) Testing PyCUDA Summary Questions Getting Started with PyCUDA Technical requirements Querying your GPU Querying your GPU with PyCUDA Using PyCUDA's gpuarray class Transferring data to and from the GPU with gpuarray Basic pointwise arithmetic operations with gpuarray A speed test Using PyCUDA's ElementWiseKernel for performing pointwise computations Mandelbrot revisited A brief foray into functional programming Parallel scan and reduction kernel basics Summary Questions Kernels, Threads, Blocks, and Grids Technical requirements Kernels The PyCUDA SourceModule function Threads, blocks, and grids Conway's game of life Thread synchronization and intercommunication Using the __syncthreads() device function Using shared memory The parallel prefix algorithm The naive parallel prefix algorithm Inclusive versus exclusive prefix A work-efficient parallel prefix algorithm Work-efficient parallel prefix (up-sweep phase) Work-efficient parallel prefix (down-sweep phase) Work-efficient parallel prefix — implementation  Summary Questions Streams, Events, Contexts, and Concurrency Technical requirements CUDA device synchronization Using the PyCUDA stream class Concurrent Conway's game of life using CUDA streams Events Events and streams Contexts Synchronizing the current context Manual context creation Host-side multiprocessing and multithreading Multiple contexts for host-side concurrency Summary Questions Debugging and Profiling Your CUDA Code Technical requirements Using printf from within CUDA kernels Using printf for debugging Filling in the gaps with CUDA-C Using the Nsight IDE for CUDA-C development and debugging Using Nsight with Visual Studio in Windows Using Nsight with Eclipse in Linux Using Nsight to understand the warp lockstep property in CUDA Using the NVIDIA nvprof profiler and Visual Profiler Summary Questions Using the CUDA Libraries with Scikit-CUDA Technical requirements Installing Scikit-CUDA Basic linear algebra with cuBLAS Level-1 AXPY with cuBLAS Other level-1 cuBLAS functions Level-2 GEMV in cuBLAS Level-3 GEMM in cuBLAS for measuring GPU performance Fast Fourier transforms with cuFFT A simple 1D FFT Using an FFT for convolution Using cuFFT for 2D convolution  Using cuSolver from Scikit-CUDA Singular value decomposition (SVD) Using SVD for Principal Component Analysis (PCA) Summary Questions The CUDA Device Function Libraries and Thrust Technical requirements The cuRAND device function library Estimating π with Monte Carlo The CUDA Math API A brief review of definite integration Computing definite integrals with the Monte Carlo method Writing some test cases The CUDA Thrust library Using functors in Thrust Summary Questions Implementation of a Deep Neural Network Technical requirements Artificial neurons and neural networks Implementing a dense layer of artificial neurons Implementation of the softmax layer Implementation of Cross-Entropy loss Implementation of a sequential network Implementation of inference methods Gradient descent Conditioning and normalizing data The Iris dataset Summary Questions Working with Compiled GPU Code Launching compiled code with Ctypes The Mandelbrot set revisited (again) Compiling the code and interfacing with Ctypes Compiling and launching pure PTX code Writing wrappers for the CUDA Driver API Using the CUDA Driver API Summary Questions Performance Optimization in CUDA Dynamic parallelism Quicksort with dynamic parallelism Vectorized data types and memory access Thread-safe atomic operations Warp shuffling Inline PTX assembly Performance-optimized array sum  Summary Questions Where to Go from Here Furthering your knowledge of CUDA and GPGPU programming Multi-GPU systems Cluster computing and MPI OpenCL and PyOpenCL Graphics OpenGL DirectX 12 Vulkan Machine learning and computer vision The basics cuDNN Tensorflow and Keras Chainer OpenCV Blockchain technology Summary Questions Assessment Chapter 1, Why GPU Programming? Chapter 2, Setting Up Your GPU Programming Environment Chapter 3, Getting Started with PyCUDA Chapter 4, Kernels, Threads, Blocks, and Grids Chapter 5, Streams, Events, Contexts, and Concurrency Chapter 6, Debugging and Profiling Your CUDA Code Chapter 7, Using the CUDA Libraries with Scikit-CUDA Chapter 8, The CUDA Device Function Libraries and Thrust Chapter 9, Implementation of a Deep Neural Network Chapter 10, Working with Compiled GPU Code Chapter 11, Performance Optimization in CUDA Chapter 12, Where to Go from Here Other Books You May Enjoy Leave a review - let other readers know what you think