Profiling a TensorFlow model

Profiling TensorFlow graphs requires that we have an NVTX plugin that enables NVTX annotations. To use NVTX annotations in TensorFlow, we need to install the nvtx-plugins-tf Python plugin using the following command:

$ pip install nvtx-plugins-tf

However, we don't have to do this if we use an NGC TensorFlow container later than version 19.o8.

TensorFlow graph APIs are symbolic APIs, so they require specific programming methods. The NVTX plugin provides two options for this: a decorator and a Python function.

Here is an example of an NVTX decorator:

import nvtx.plugins.tf as nvtx_tf
ENABLE_NVTX=true
@nvtx_tf.ops.trace(message='Dense Block', domain_name='Forward',
        grad_domain_name='Gradient', enabled=ENABLE_NVTX, 
        trainable=True)
def dense_layer(x):
    x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1')
    x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’) 
return x

The following is an example of an NVTX Python function:

import nvtx.plugins.tf as nvtx_tf
ENABLE_NVTX=true
x, nvtx_context = nvtx_tf.ops.start(x, message='Dense Block',  
        domain_name='Forward’, grad_domain_name='Gradient’, 
        enabled=ENABLE_NVTX, trainable=True)
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1')
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’) 
x = nvtx_tf.ops.end(x, nvtx_context)

The NVTX plugin provides NVTXHook, which allows us to profile the TF estimator and session. For example, we can use the hook as follows:

from nvtx.plugins.tf.estimator import NVTXHook

nvtx_callback = NVTXHook(skip_n_steps=1, name='Train’)
training_hooks=[]
training_hooks.append(nvtx_callback)

Then, we can apply this to either option using the following code:

with tf.train.MonitoredSession(hooks=training_hooks) as sess:

Alternatively, we can use the following code:

tf.estimator.Estimator(hooks=training_hooks, ...)

Now, let's apply this to the sample ResNet-50 code and review the operation. The example code can be found in the 05_framework_profile/tensorflow/RN50v1.5 folder:

Let's begin by applying NVTXHook to the estimator. The training graph's definition can be found in the runtime/runner.py file on line 312. Ahead of building the graph, we will append NVTXHook to the list of hooks, as shown in the following block of code:

Then, we will apply the NVTX annotation to the model-building function. The model_build() function can be found in the ResnetModel class in the model/resnet_v1_5.py file. The following code shows an example of placing an NVTX annotation by using a Python function on the conv1 layer in the model_build() function:

In the preceding code, we need to be cautious when to use proper inputs and outputs when using the nvtx_tf.ops.start() and nvtx_tf.ops.end() functions. Only place NVTX annotations in the other layers. Be sure that the final fully connected layer's output is the output of the network.

We also have to disable the code to check the number of trainable variables it has. If NVTX's trainable parameter's value is True, the size changes. At line 174 in the resnet_v1_5.py file, there's a block of assertion code that checks the number of that variable. Simply comment it out, as follows:

We also use NVTX decorators for the ResNet building blocks. In the model/blocks directory, we can find the conv2d and ResNet bottleneck block implementations in conv2d_blocks.py and resnet_bottleneck_block.py. In the conv2d_blocks.py file, we can decorate conv2d_block() function to annotate NVTX profiling, as follows:

In the same way, we can do the same to the resnet_bottleneck_block.py file:

Now, let's profile the model. Like we did with the PyTorch container, we will use TensorFlow's NGC container. We will assume that the imagenet dataset's tfrecord files are located in the /raid/datasets/imagenet/tfrecord directory. The following code shows a bash shell script that executes the container and profiles the network:

#/bin/bash

CODE_PATH="RN50v1.5"
DATASET_PATH="/raid/datasets/imagenet/tfrecord"
OUTPUT_NAME="resnet50_tf"

# default profile
docker run --rm -ti --runtime=nvidia 
    -v $(pwd):/result 
    -v $(pwd)/${CODE_PATH}:/workspace 
    -v ${DATASET_PATH}:/imagenet 
    nvcr.io/nvidia/tensorflow:19.08-py3 
        nsys profile -t cuda,nvtx,cudnn,cublas -o ${OUTPUT_NAME} 
                     -f true -w true -y 40 -d 20 
            python /workspace/main.py --mode=training_benchmark 
                                      --warmup_steps 200 
                --num_iter 500 --iter_unit batch 
                --results_dir=results --batch_size 64

When we execute this function, we will get the resnet50_tf.qdrep file in the RN50v1.5 directory.

Finally, let's review the profiled output using the NVIDIA Nsight System:

Here, we can confirm that backward propagation takes twice as long as the forward pass. This example code isn't synchronized with the CPU and the GPU. Due to this, we can see a larger time difference between the host and the GPU. As we place additional annotations in the building blocks, we will be able to see the sub-block annotations in the layers.

Profiling with NVIDIA Nsight Systems provides additional benefits when it comes to monitoring all-reduce's execution time in multi-GPU training. The following screenshot shows the profiling result of a GPU that was training with two GPUs:

In the highlighted row, we can see the ncclAllRecude() function, which calls the backward propagation simultaneously. By doing this, we don't get the delay of all-reduce operation. This example code uses Horovod to train multiple GPUs. If you want to learn more about this, visit Horovod's GitHub page: https://github.com/horovod/horovod. You can get the document and example code from here.

Table of Contents for Profiling a TensorFlow model

Create new playlist

Sign In

Sign Up

Table of Contents for
Profiling a TensorFlow model