0%

This IBM® Redpaper publication describes the architecture, installation procedure, and results for running a typical training application that works on an automotive data set in an orchestrated and secured environment that provides horizontal scalability of GPU resources across physical node boundaries for deep neural network (DNN) workloads.

This paper is mostly relevant for systems engineers, system administrators, or system architects that are responsible for data center infrastructure management and typical day-to-day operations such as system monitoring, operational control, asset management, and security audits.

This paper also describes IBM Spectrum® LSF® as a workload manager and IBM Spectrum Discover as a metadata search engine to find the right data for an inference job and automate the data science workflow. With the help of this solution, the data location, which may be on different storage systems, and time of availability for the AI job can be fully abstracted, which provides valuable information for data scientists.

Table of Contents

  1. Front cover
  2. Notices
    1. Trademarks
  3. Preface
    1. Authors
    2. Now you can become a published author, too!
    3. Comments welcome
    4. Stay connected to IBM Redbooks
  4. Chapter 1. Overview
    1. 1.1 Proof of concept background
  5. Chapter 2. Proof of concept environment
    1. 2.1 Overview
    2. 2.2 Prerequisites
  6. Chapter 3. Installation
    1. 3.1 Configuring the NVIDIA Mellanox EDR InfiniBand network
    2. 3.2 Integrating DGX-1 systems as worker nodes into a Red Hat OpenShift 4.4.3 cluster
    3. 3.2.1 Installing the Red Hat Enterprise Linux 7.6 and DGX software
    4. 3.2.2 Installing NVIDIA Mellanox InfiniBand drivers (MLNX_OFED)
    5. 3.2.3 Installing the GDRDMA kernel module
    6. 3.2.4 Installing the NVIDIA Mellanox SELinux module
    7. 3.2.5 Adding DGX-1 systems as worker nodes to the Red Hat OpenShift cluster
    8. 3.3 Adding DGX-1 systems as client nodes to the IBM Spectrum Scale cluster
    9. 3.4 Installing and configuring more components in the Red Hat OpenShift 4.4.3 stack
    10. 3.4.1 Special Resource Operator
    11. 3.4.2 NVIDIA Mellanox RDMA Shared Device plug-in
    12. 3.4.3 Enabling the IPC_LOCK capability in the user namespace for the RDMA Shared Device plug-in
    13. 3.4.4 MPI Operator
    14. 3.4.5 IBM Spectrum Scale CSI
  7. Chapter 4. Preparation and functional testing
    1. 4.1 Testing remote direct memory access through an InfiniBand network
    2. 4.2 Preparing persistent volumes with IBM Spectrum Scale Container Storage Interface
    3. 4.3 MPIJob definition
    4. 4.4 Connectivity tests with the NVIDIA Collective Communications Library
    5. 4.5 Multi-GPU and multi-Node GPU scaling with TensorFlow ResNet-50 benchmark
  8. Chapter 5. Deep neural network training on the Audi Autonomous Driving Dataset semantic segmentation data set
    1. 5.1 Description of the A2D2
    2. 5.2 Multi-node GPU scaling results for deep neural network training jobs
    3. 5.3 Application
    4. 5.4 Integrating IBM Spectrum Discover and IBM Spectrum LSF to find the correct data based on labels
  9. Related publications
    1. Online resources
    2. Help from IBM
  10. Back cover
3.142.98.108