Contents

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Dedication Page

Contents

Acknowledgments

About the Contributors

Chapter 1: OpenACC in a Nutshell

1.1 OpenACC Syntax

1.1.1 Directives

1.1.3 API Routines and Environment Variables

1.2 Compute Constructs

1.3 The Data Environment

1.3.1 Data Directives

1.3.2 Data Clauses

1.3.3 The Cache Directive

1.3.4 Partial Data Transfers

Chapter 2: Loop-Level Parallelism

2.1 Kernels Versus Parallel Loops

2.2 Three Levels of Parallelism

2.2.1 Gang, Worker, and Vector Clauses

2.2.2 Mapping Parallelism to Hardware

2.3 Other Loop Constructs

2.3.1 Loop Collapse

2.3.2 Independent Clause

2.3.3 Seq and Auto Clauses

2.3.4 Reduction Clause

Chapter 3: Programming Tools for OpenACC

3.1 Common Characteristics of Architectures

3.2 Compiling OpenACC Code

3.3 Performance Analysis of OpenACC Applications

3.3.1 Performance Analysis Layers and Terminology

3.3.2 Performance Data Acquisition

3.3.3 Performance Data Recording and Presentation

3.3.4 The OpenACC Profiling Interface

3.3.5 Performance Tools with OpenACC Support

3.3.6 The NVIDIA Profiler

3.3.7 The Score-P Tools Infrastructure for Hybrid Applications

3.3.8 TAU Performance System

3.4 Identifying Bugs in OpenACC Programs

Chapter 4: Using OpenACC for Your First Program

4.1.1 Serial Code

4.1.2 Compiling the Code

4.2 Creating a Naive Parallel Version

4.2.1 Find the Hot Spot

4.2.2 Is It Safe to Use kernels?

4.2.3 OpenACC Implementations

4.3 Performance of OpenACC Programs

4.4 An Optimized Parallel Version

4.4.1 Reducing Data Movement

4.4.2 Extra Clever Tweaks

4.4.3 Final Result

Chapter 5: Compiling OpenACC

5.1 The Challenges of Parallelism

5.1.1 Parallel Hardware

5.1.2 Mapping Loops

5.1.3 Memory Hierarchy

5.1.4 Reductions

5.1.5 OpenACC for Parallelism

5.2 Restructuring Compilers

5.2.1 What Compilers Can Do

5.2.2 What Compilers Can’t Do

5.3 Compiling OpenACC

5.3.1 Code Preparation

5.3.2 Scheduling

5.3.3 Serial Code

5.3.4 User Errors

Chapter 6: Best Programming Practices

6.1 General Guidelines

6.1.1 Maximizing On-Device Computation

6.1.2 Optimizing Data Locality

6.2 Maximize On-Device Compute

6.2.1 Atomic Operations

6.2.2 Kernels and Parallel Constructs

6.2.3 Runtime Tuning and the If Clause

6.3 Optimize Data Locality

6.3.1 Minimum Data Transfer

6.3.2 Data Reuse and the Present Clause

6.3.3 Unstructured Data Lifetimes

6.3.4 Array Shaping

6.4 A Representative Example

6.4.1 Background: Thermodynamic Tables

6.4.2 Baseline CPU Implementation

6.4.3 Profiling

6.4.4 Acceleration with OpenACC

6.4.5 Optimized Data Locality

6.4.6 Performance Study

Chapter 7: OpenACC and Performance Portability

7.2 Target Architectures

7.2.1 Compiling for Specific Platforms

7.2.2 x86_64 Multicore and NVIDIA

7.3 OpenACC for Performance Portability

7.3.1 The OpenACC Memory Model

7.3.2 Memory Architectures

7.3.3 Code Generation

7.3.4 Data Layout for Performance Portability

7.4 Code Refactoring for Performance Portability

7.4.2 Targeting Multiple Architectures

7.4.3 OpenACC over NVIDIA K20x GPU

7.4.4 OpenACC over AMD Bulldozer Multicore

Chapter 8: Additional Approaches to Parallel Programming

8.1 Programming Models

8.1.8 Threading Building Blocks

8.2 Programming Model Components

8.2.1 Parallel Loops

8.2.2 Parallel Reductions

8.2.3 Tightly Nested Loops

8.2.4 Hierarchical Parallelism (Non-Tightly Nested Loops)

8.2.5 Task Parallelism

8.2.6 Data Allocation

8.2.7 Data Transfers

8.3 A Case Study

8.3.1 Serial Implementation

8.3.2 The OpenACC Implementation

8.3.3 The OpenMP Implementation

8.3.4 The CUDA Implementation

8.3.5 The Kokkos Implementation

8.3.6 The TBB Implementation

8.3.7 Some Performance Numbers

Chapter 9: OpenACC and Interoperability

9.1 Calling Native Device Code from OpenACC

9.1.1 Example: Image Filtering Using DFTs

9.1.2 The host_data Directive and the use_device Clause

9.1.3 API Routines for Target Platforms

9.2 Calling OpenACC from Native Device Code

9.3 Advanced Interoperability Topics

9.3.1 acc_map_data

9.3.2 Calling CUDA Device Routines from OpenACC Kernels

Chapter 10: Advanced OpenACC

10.1 Asynchronous Operations

10.1.1 Asynchronous OpenACC Programming

10.1.2 Software Pipelining

10.2 Multidevice Programming

10.2.1 Multidevice Pipeline

10.2.2 OpenACC and MPI

Chapter 11: Innovative Research Ideas Using OpenACC, Part I

11.1 Sunway OpenACC

11.1.1 The SW26010 Manycore Processor

11.1.2 The Memory Model in the Sunway TaihuLight

11.1.3 The Execution Model

11.1.4 Data Management

11.2 Compiler Transformation of Nested Loops for Accelerators

11.2.1 The OpenUH Compiler Infrastructure

11.2.2 Loop-Scheduling Transformation

11.2.3 Performance Evaluation of Loop Scheduling

11.2.4 Other Research Topics in OpenUH

Chapter 12: Innovative Research Ideas Using OpenACC, Part II

12.1 A Framework for Directive-Based High-Performance Reconfigurable Computing

12.1.1 Introduction

12.1.2 Baseline Translation of OpenACC-to-FPGA

12.1.3 OpenACC Extensions and Optimization for Efficient FPGA Programming

12.1.4 Evaluation

12.2 Programming Accelerated Clusters Using XcalableACC

12.2.1 Introduction to XcalableMP

12.2.2 XcalableACC: XcalableMP Meets OpenACC

12.2.3 Omni Compiler Implementation

12.2.4 Performance Evaluation on HA-PACS

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

3.147.48.212