Using the NVIDIA nvprof profiler and Visual Profiler

We will end with a brief overview of the command-line Nvidia nvprof profiler. In contrast to the Nsight IDE, we can freely use any Python code that we have written—we won't be compelled here to write full-on, pure CUDA-C test function code.

We can do a basic profiling of a binary executable program with the nvprof program command; we can likewise profile a Python script by using the python command as the first argument, and the script as the second as follows: nvprof python program.py. Let's profile the simple matrix-multiplication CUDA-C executable program that we wrote earlier, with nvprof matrix_ker:

We see that this is very similar to the output of the Python cProfiler module that we first used to analyze a Mandelbrot algorithm way back in Chapter 1, Why GPU Programming?—only now, this exclusively tells us only about all of the CUDA operations that were executed. So, we can use this when we specifically want to optimize on the GPU, rather than concern ourselves with any of the Python or other commands that executed on the host. (We can further analyze each individual CUDA kernel operation with block and grid size launch parameters if we add the command-line option, --print-gpu-trace.)

Let's look at one more trick to help us visualize the execution time of all of the operations of a program; we will use nvprof to dump a file that can then be read by the NVIDIA Visual Profiler, which will show this to us graphically. Let's do this using an example from the last chapter, multi-kernel_streams.py (this is available in the repository under 5). Let's recall that this was one of our introductory examples to the idea of CUDA streams, which allow us to execute and organize multiple GPU operations concurrently. Let's dump the output to a file with the .nvvp file suffix with the -o command-line option as follows: nvprof -o m.nvvp python multi-kernel_streams.py. We can now load this file into the NVIDIA Visual Profiler with the nvvp m.nvvp command.

We should see a timeline across all CUDA streams as such (remembering that the name of the kernel used in this program is called mult_ker):

Not only can we see all kernel launches, but also memory allocations, memory copies, and other operations. This can be useful for getting an intuitive and visual understanding of how your program is using your GPU over time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.104.153