Contents

Foreword

Preface

Acknowledgments

About the Contributors

Chapter 1: OpenACC in a Nutshell

1.1 OpenACC Syntax

1.1.1 Directives

1.1.2 Clauses

1.1.3 API Routines and Environment Variables

1.2 Compute Constructs

1.2.1 Kernels

1.2.2 Parallel

1.2.3 Loop

1.2.4 Routine

1.3 The Data Environment

1.3.1 Data Directives

1.3.2 Data Clauses

1.3.3 The Cache Directive

1.3.4 Partial Data Transfers

1.4 Summary

1.5 Exercises

Chapter 2: Loop-Level Parallelism

2.1 Kernels Versus Parallel Loops

2.2 Three Levels of Parallelism

2.2.1 Gang, Worker, and Vector Clauses

2.2.2 Mapping Parallelism to Hardware

2.3 Other Loop Constructs

2.3.1 Loop Collapse

2.3.2 Independent Clause

2.3.3 Seq and Auto Clauses

2.3.4 Reduction Clause

2.4 Summary

2.5 Exercises

Chapter 3: Programming Tools for OpenACC

3.1 Common Characteristics of Architectures

3.2 Compiling OpenACC Code

3.3 Performance Analysis of OpenACC Applications

3.3.1 Performance Analysis Layers and Terminology

3.3.2 Performance Data Acquisition

3.3.3 Performance Data Recording and Presentation

3.3.4 The OpenACC Profiling Interface

3.3.5 Performance Tools with OpenACC Support

3.3.6 The NVIDIA Profiler

3.3.7 The Score-P Tools Infrastructure for Hybrid Applications

3.3.8 TAU Performance System

3.4 Identifying Bugs in OpenACC Programs

3.5 Summary

3.6 Exercises

Chapter 4: Using OpenACC for Your First Program

4.1 Case Study

4.1.1 Serial Code

4.1.2 Compiling the Code

4.2 Creating a Naive Parallel Version

4.2.1 Find the Hot Spot

4.2.2 Is It Safe to Use kernels?

4.2.3 OpenACC Implementations

4.3 Performance of OpenACC Programs

4.4 An Optimized Parallel Version

4.4.1 Reducing Data Movement

4.4.2 Extra Clever Tweaks

4.4.3 Final Result

4.5 Summary

4.6 Exercises

Chapter 5: Compiling OpenACC

5.1 The Challenges of Parallelism

5.1.1 Parallel Hardware

5.1.2 Mapping Loops

5.1.3 Memory Hierarchy

5.1.4 Reductions

5.1.5 OpenACC for Parallelism

5.2 Restructuring Compilers

5.2.1 What Compilers Can Do

5.2.2 What Compilers Can’t Do

5.3 Compiling OpenACC

5.3.1 Code Preparation

5.3.2 Scheduling

5.3.3 Serial Code

5.3.4 User Errors

5.4 Summary

5.5 Exercises

Chapter 6: Best Programming Practices

6.1 General Guidelines

6.1.1 Maximizing On-Device Computation

6.1.2 Optimizing Data Locality

6.2 Maximize On-Device Compute

6.2.1 Atomic Operations

6.2.2 Kernels and Parallel Constructs

6.2.3 Runtime Tuning and the If Clause

6.3 Optimize Data Locality

6.3.1 Minimum Data Transfer

6.3.2 Data Reuse and the Present Clause

6.3.3 Unstructured Data Lifetimes

6.3.4 Array Shaping

6.4 A Representative Example

6.4.1 Background: Thermodynamic Tables

6.4.2 Baseline CPU Implementation

6.4.3 Profiling

6.4.4 Acceleration with OpenACC

6.4.5 Optimized Data Locality

6.4.6 Performance Study

6.5 Summary

6.6 Exercises

Chapter 7: OpenACC and Performance Portability

7.1 Challenges

7.2 Target Architectures

7.2.1 Compiling for Specific Platforms

7.2.2 x86_64 Multicore and NVIDIA

7.3 OpenACC for Performance Portability

7.3.1 The OpenACC Memory Model

7.3.2 Memory Architectures

7.3.3 Code Generation

7.3.4 Data Layout for Performance Portability

7.4 Code Refactoring for Performance Portability

7.4.1 HACCMK

7.4.2 Targeting Multiple Architectures

7.4.3 OpenACC over NVIDIA K20x GPU

7.4.4 OpenACC over AMD Bulldozer Multicore

7.5 Summary

7.6 Exercises

Chapter 8: Additional Approaches to Parallel Programming

8.1 Programming Models

8.1.1 OpenACC

8.1.2 OpenMP

8.1.3 CUDA

8.1.4 OpenCL

8.1.5 C++ AMP

8.1.6 Kokkos

8.1.7 RAJA

8.1.8 Threading Building Blocks

8.1.9 C++17

8.1.10 Fortran

8.2 Programming Model Components

8.2.1 Parallel Loops

8.2.2 Parallel Reductions

8.2.3 Tightly Nested Loops

8.2.4 Hierarchical Parallelism (Non-Tightly Nested Loops)

8.2.5 Task Parallelism

8.2.6 Data Allocation

8.2.7 Data Transfers

8.3 A Case Study

8.3.1 Serial Implementation

8.3.2 The OpenACC Implementation

8.3.3 The OpenMP Implementation

8.3.4 The CUDA Implementation

8.3.5 The Kokkos Implementation

8.3.6 The TBB Implementation

8.3.7 Some Performance Numbers

8.4 Summary

8.5 Exercises

Chapter 9: OpenACC and Interoperability

9.1 Calling Native Device Code from OpenACC

9.1.1 Example: Image Filtering Using DFTs

9.1.2 The host_data Directive and the use_device Clause

9.1.3 API Routines for Target Platforms

9.2 Calling OpenACC from Native Device Code

9.3 Advanced Interoperability Topics

9.3.1 acc_map_data

9.3.2 Calling CUDA Device Routines from OpenACC Kernels

9.4 Summary

9.5 Exercises

Chapter 10: Advanced OpenACC

10.1 Asynchronous Operations

10.1.1 Asynchronous OpenACC Programming

10.1.2 Software Pipelining

10.2 Multidevice Programming

10.2.1 Multidevice Pipeline

10.2.2 OpenACC and MPI

10.3 Summary

10.4 Exercises

Chapter 11: Innovative Research Ideas Using OpenACC, Part I

11.1 Sunway OpenACC

11.1.1 The SW26010 Manycore Processor

11.1.2 The Memory Model in the Sunway TaihuLight

11.1.3 The Execution Model

11.1.4 Data Management

11.1.5 Summary

11.2 Compiler Transformation of Nested Loops for Accelerators

11.2.1 The OpenUH Compiler Infrastructure

11.2.2 Loop-Scheduling Transformation

11.2.3 Performance Evaluation of Loop Scheduling

11.2.4 Other Research Topics in OpenUH

Chapter 12: Innovative Research Ideas Using OpenACC, Part II

12.1 A Framework for Directive-Based High-Performance Reconfigurable Computing

12.1.1 Introduction

12.1.2 Baseline Translation of OpenACC-to-FPGA

12.1.3 OpenACC Extensions and Optimization for Efficient FPGA Programming

12.1.4 Evaluation

12.1.5 Summary

12.2 Programming Accelerated Clusters Using XcalableACC

12.2.1 Introduction to XcalableMP

12.2.2 XcalableACC: XcalableMP Meets OpenACC

12.2.3 Omni Compiler Implementation

12.2.4 Performance Evaluation on HA-PACS

12.2.5 Summary

Index

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.48.212