Chapter 1: OpenACC in a Nutshell
1.1.3 API Routines and Environment Variables
Chapter 2: Loop-Level Parallelism
2.1 Kernels Versus Parallel Loops
2.2 Three Levels of Parallelism
2.2.1 Gang, Worker, and Vector Clauses
2.2.2 Mapping Parallelism to Hardware
Chapter 3: Programming Tools for OpenACC
3.1 Common Characteristics of Architectures
3.3 Performance Analysis of OpenACC Applications
3.3.1 Performance Analysis Layers and Terminology
3.3.2 Performance Data Acquisition
3.3.3 Performance Data Recording and Presentation
3.3.4 The OpenACC Profiling Interface
3.3.5 Performance Tools with OpenACC Support
3.3.7 The Score-P Tools Infrastructure for Hybrid Applications
3.4 Identifying Bugs in OpenACC Programs
Chapter 4: Using OpenACC for Your First Program
4.2 Creating a Naive Parallel Version
4.2.2 Is It Safe to Use kernels?
4.3 Performance of OpenACC Programs
4.4 An Optimized Parallel Version
5.1 The Challenges of Parallelism
5.2.2 What Compilers Can’t Do
Chapter 6: Best Programming Practices
6.1.1 Maximizing On-Device Computation
6.1.2 Optimizing Data Locality
6.2 Maximize On-Device Compute
6.2.2 Kernels and Parallel Constructs
6.2.3 Runtime Tuning and the If Clause
6.3.2 Data Reuse and the Present Clause
6.3.3 Unstructured Data Lifetimes
6.4.1 Background: Thermodynamic Tables
6.4.2 Baseline CPU Implementation
6.4.4 Acceleration with OpenACC
Chapter 7: OpenACC and Performance Portability
7.2.1 Compiling for Specific Platforms
7.2.2 x86_64 Multicore and NVIDIA
7.3 OpenACC for Performance Portability
7.3.1 The OpenACC Memory Model
7.3.4 Data Layout for Performance Portability
7.4 Code Refactoring for Performance Portability
7.4.2 Targeting Multiple Architectures
7.4.3 OpenACC over NVIDIA K20x GPU
7.4.4 OpenACC over AMD Bulldozer Multicore
Chapter 8: Additional Approaches to Parallel Programming
8.1.8 Threading Building Blocks
8.2 Programming Model Components
8.2.4 Hierarchical Parallelism (Non-Tightly Nested Loops)
8.3.2 The OpenACC Implementation
8.3.3 The OpenMP Implementation
8.3.5 The Kokkos Implementation
8.3.7 Some Performance Numbers
Chapter 9: OpenACC and Interoperability
9.1 Calling Native Device Code from OpenACC
9.1.1 Example: Image Filtering Using DFTs
9.1.2 The host_data Directive and the use_device Clause
9.1.3 API Routines for Target Platforms
9.2 Calling OpenACC from Native Device Code
9.3 Advanced Interoperability Topics
9.3.2 Calling CUDA Device Routines from OpenACC Kernels
10.1.1 Asynchronous OpenACC Programming
Chapter 11: Innovative Research Ideas Using OpenACC, Part I
11.1.1 The SW26010 Manycore Processor
11.1.2 The Memory Model in the Sunway TaihuLight
11.2 Compiler Transformation of Nested Loops for Accelerators
11.2.1 The OpenUH Compiler Infrastructure
11.2.2 Loop-Scheduling Transformation
11.2.3 Performance Evaluation of Loop Scheduling
11.2.4 Other Research Topics in OpenUH
Chapter 12: Innovative Research Ideas Using OpenACC, Part II
12.1 A Framework for Directive-Based High-Performance Reconfigurable Computing
12.1.2 Baseline Translation of OpenACC-to-FPGA
12.1.3 OpenACC Extensions and Optimization for Efficient FPGA Programming
12.2 Programming Accelerated Clusters Using XcalableACC
12.2.1 Introduction to XcalableMP
12.2.2 XcalableACC: XcalableMP Meets OpenACC
12.2.3 Omni Compiler Implementation
3.147.48.212