Preface

It is known that increasing the number of cores in a multicore central processing unit (CPU) to thousands is not possible due to architectural and power limitations, while a graphics processing unit (GPU) can easily exploit thousands of simple and efficient cores each executing a single thread of instructions.

The enormous appetite for both computation and bandwidth of graphic applications has led to the emergence of GPUs as the dominant massively parallel architecture. Today, general-purpose GPUs (GPGPU) with floating-point computation capability are widely used for general purpose computations in most of the top machines with hybrid CPU-GPU architectures. A recent product of such GPGPUs, NVIDIA Pascal, has 3840 cores.

The main problem with such powerful computing chips is the management of such a huge number of cores. New architectural techniques to further increase the number of cores, while considering the technological limitations affecting their reliability, performance, and energy characteristics, for future needs are of great importance. And the lack of efficient programming models and system software to exploit the full capabilities of such a huge number of cores in real applications is the most challenging problem.

This book focuses on the research and practices in GPU-based systems and tries to address their important issues. The topics cover a range of issues from hardware and architectural issues to high-level issues immediately concerning the application or system users, including parallel programming tools and middleware support for such computing systems. The book is divided into four parts, with each including chapters authored by known researchers, who have shared their recent research findings under the topic of that part.

Part 1 deals with different programming issues and tools. This part is comprised of five chapters: 1–5. Chapter 1 discusses how to program a GPU in a reliable manner. Although GPU programs provide high-computational throughput in distinct areas, it is somehow difficult to correctly optimize the codes. This is due to the subtleties of GPU concurrency. For this reason, the focus of this chapter is to discuss and address several things to make the GPU programming easier. The chapter also provides some insights into the recent progress on rigorous methods for the formal analysis of GPU software. The latter is an important issue, with the advances in GPU programming. Therefore verification methods should be taught to new GPU programmers.

Chapter 2 introduces a unified Open Computing Language (OpenCL) framework for heterogeneous clusters (SnuCL). It is a freely available, open-source OpenCL framework for heterogeneous clusters. Basically, OpenCL is a programming model for heterogeneous parallel computing systems, and it defines an abstraction layer between traditional processors and accelerators. The pros of using OpenCL is that programmers write an OpenCL application once and run it on any system that is OpenCL compatible. However, it lacks the ability to target a heterogeneous cluster. SnuCL provides the programmer with the illusion of a single, unified OpenCL platform image for the cluster. With the help of SnuCL, OpenCL applications are able to utilize compute devices in a compute node as if they were in the host node. Moreover, with SnuCL, it is possible to integrate multiple OpenCL platforms from different vendors into a single platform. Therefore OpenCL objects are shared among compute devices from different vendors and are able to achieve high performance and ease of programming for heterogeneous systems.

Chapter 3 discusses on-thread communication and synchronization on massively parallel GPUs. On GPUs, where thousands of threads run simultaneously, the performance of the processor largely depends on the efficiency of communication and synchronization among threads. Understanding which mechanisms are supported on modern GPUs and their implication on algorithm design is a key issue in writing efficient GPU codes. Due to the fact that conventional GPGPU workloads are massively parallel, with little cooperation among threads, early GPUs supported only coarse-grained thread communication and synchronization. Nowadays, the current trend is to accelerate more diverse workloads and that means coarse-grained mechanisms are a major limiting factor in exploiting the parallelism. The latest industry standard programming framework, OpenCL 2.0, introduces fine-grained thread communication and synchronization support to address this issue. In this chapter, both coarse-grained and fine-grained thread synchronization and communication mechanisms available on modern GPUs are discussed.

Chapter 4 focuses on software-level task scheduling on GPUs. To exploit the full potential of the many-core processor, task scheduling is a critical issue. Contrary to CPUs, GPUs lack the necessary APIs for the programmers or compilers to control scheduling. As a result, it is difficult to use the hardware scheduler on modern GPUs in a flexible manner. This chapter presents a compiler and runtime framework to automatically transform and optimize GPU programs to enable controllable task scheduling to the streaming multiprocessors (SMs). In the center of the framework is SM-centric transformation, which addresses the complexities of the hardware scheduler and provides the scheduling capability. The framework opens many opportunities for new optimizations, of which this chapter presents three of them for optimizing parallelism, locality, and processor partitioning. Extensive experiments reveal that the optimizations can substantially improve the performance of a set of GPU programs in multiple scenarios.

Chapter 5 investigates the complexity of data placement on GPUs and introduces PORPLE, a software framework, to show how to automatically resolve the complexity of a given GPU application. The concept of data placement is an important issue because modern GPU memory systems consist of a number of components with different properties and the question is how to place data on various memory components.

Part 2 presents some useful algorithms and applications for GPUs. It includes nine chapters: 6–14. Chapter 6 focuses on biological sequences’ analyses. High-throughput techniques for DNA sequencing have led to an exponential growth of biological databases. These biological sequences have to be analyzed and interpreted in order to determine their function and structure. The problem is that the growth of biological databases is faster than the performance of a single-core processor. With the emergence of many-core processors, it is possible to take advantage of biological sequence analysis tools. This chapter discusses the recent advances in GPUs for two main sequence comparison problems: pairwise sequence comparison and sequence-profile comparison.

Chapter 13 discusses graph algorithms on GPUs. It is started by presenting and comparing the main data structures and techniques applied for representing and analyzing graphs on GPUs. The chapter is followed by theory and updated reviews on efficient graph algorithms for GPUs. The algorithms focused on in the chapter are mainly traversal algorithms (breadth-first search), single-source shortest path (Djikstra, Bellman-Ford, delta stepping, hybrids), and all-pair shortest path (Floyd-Warshall). Later, the chapter discusses load balancing and memory access techniques, with an overview of their main issues and management techniques.

Chapter 8 considers the alignment of sequences where authors consider the optimal alignment of two and three sequences using GPUs. The problem of aligning two sequences is commonly referred to as pairwise alignment. Experimental results on the NVIDIA Tesla C2050 are then presented.

Chapter 9 introduces the Augmented Block Cimmino Distributed (ABCD) algorithm for solving tri-diagonal systems on GPU. Tri-diagonal systems with their special structures mostly appear in scientific and engineering problems, such as Alternating Direction Implicit methods, fluid simulation, and Poisson’s equation. This chapter presents the parallelization of the ABCD method for solving tri-diagonal systems on GPU. Among various aspects, this chapter investigate the boundary padding technique to eliminate the execution branches on GPU. Various performance optimization techniques, such as memory coalescing, are also incorporated to further enhance the performance. Performance evaluation shows over 24 times speedups of the GPU implementation over the traditional CPU version.

Chapter 10 discusses linear and mixed-integer programming methods. It shows that the complex problems in the operations research (OR) community can take great benefit of GPUs. Authors also present a survey on the main contributions to the field of GPU computing applied to linear and mixed-integer programming by highlighting how different authors overcome the difficulties of implementation.

Chapter 11 considers the accelerated implementation of the shortest path computation for planar graphs. Three algorithms and their associated GPU implementations are described for two types of shortest path problems. The first algorithm solves the All-Pairs Shortest Path problem while the second algorithm trades this ability for better parallel scaling properties and memory access. The third algorithm solves the Single-Pair Shortest Path query problem. The implementation results show that the computational power of 256 GPUs are exploited simultaneously. Therefore the improvement gap is an order of magnitude more than existing parallel approaches.

Chapter 12 discusses the sort algorithms on GPUs. There is a brief introduction on CUDA programming, memory, and computation models, and on generic GPU programming strategies. Then current GPU implementations of common and popular parallel sorting algorithms including Bitonic, Radix, And Merge Sorts are studied in details while other algorithms including Quicksort and Warp sort are briefly discussed.

In Chapter 22, Massively Parallel Compression (MPC), an effective floating-point compression algorithm for GPUs is introduced. Lossless compression algorithms for single- and double-precision floating-point values are considered in this chapter. Authors also compare the algorithms to choose the best one, explain why it compresses efficiently, and derive the MPC algorithm from them. This algorithm requires little internal state, and roughly matches the best CPU-based algorithms in compression ratio. The result shows that MPC outperforms others by one to two orders of magnitude in throughput and energy efficiency.

Chapter 14 discusses sparse matrix-vector multiplication (SpMV), which is a fundamental computational kernel used in scientific and engineering applications. The nonzero elements of sparse matrices are represented in various formats, and a single sparse matrix representation is not suitable for all sparse matrices with different sparsity patterns. This chapter first introduces an adaptive representation approach for SpMV. The proposed adaptive GPU-based SpMV scheme is based on the configuration and characteristics of GPU cores and chooses the best representation for the given input matrix. The effect of various parameters and different settings on the performance of SpMV applications when employing different sparse formats are studied. Compared to the state-of-the-art sparse library and the latest GPU SpMV method, the proposed adaptive scheme improves the performance of SpVM by 2.1× for single-precision and 1.6× for double-precision formats, on average.

Part 3 focuses on the architecture and performance issues and it includes four chapters: 15–18. In Chapter 15, a framework is introduced to accelerate the bottlenecks in GPU application. The hard part is that there are different bottlenecks during the execution since heterogeneous application requirements are different and hence they create imbalances in the utilization of resources in a GPU. In this chapter, the Core-Assisted Bottleneck Acceleration (CABA) framework is introduced which employs idle on-chip resources to eliminate different bottlenecks in GPU execution. CABA provides flexible mechanisms for generating “assist warps” automatically that execute on GPU cores to perform specific tasks. In other words, it enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck. Therefore it improves the GPU performance and efficiency. CABA architecture is then discussed and evaluated thoroughly to perform effective and flexible data compression in the GPU memory subsystem hierarchy. Results show that using CABA for data compression, provides an average performance improvement of 41.7% (maximum 2.6×) across a wide range of memory-bandwidth-sensitive GPGPU applications.

Chapter 16 considers the acceleration of GPUs through neural algorithmic transformation. Basically, the high-performance gain of GPUs is achieved by exploiting large degrees of data-level parallelism and employing the single instruction multiple thread (SIMT) execution model. Many application domains benefit from GPU, including recognition, gaming, data analytics, weather prediction, and multimedia. Most of the above-mentioned applications are good candidates for approximate computing. Therefore, GPU performance and energy efficiency are improved with such a feature. Among approximation techniques, neural accelerators yield significant performance and efficiency gains. This chapter describes a neurally accelerated GPU (NGPU) architecture that embeds neural acceleration within GPU accelerators without hurting their SIMT execution.

The authors, in Chapter 17, introduce a heterogeneous photonic network-on-chip with dynamic bandwidth allocation for GPUs. It is predicted that future multicore chips will have hundreds of heterogeneous components, including processing nodes, distributed memory, custom logic, GPU units, and programmable fabrics. Knowing the fact that future chips are expected to run varied, multiple parallel workloads simultaneously, different communicating cores will require different bandwidths. Therefore the existence of a heterogeneous Network-on-Chip (NoC) architecture is a must. Recent research has shown that photonic interconnects are capable of achieving high-bandwidth and energy-efficient on-chip data transfer. This chapter discusses a dynamic heterogeneous photonic NoC (d-HetPNOC) architecture with dynamic bandwidth allocation for better performance and energy-efficiency compared to a homogeneous photonic NoC architecture.

Chapter 18 presents some models for GPGPU frequency scaling, including CRISP, the first runtime analytical model of performance in the presence of variable frequency in a GPGPU. Prior models failed to account for important characteristics of GPGPU execution since they did not target the GPGPU. The shortcomings include the high degree of overlap between memory access and computation and the frequency of store-related stalls. The accuracy of CRISP is much higher than prior runtime performance models, being within 4%, on average, when scaling the frequency by up to 7×. The result reveals a 10.7% improvement in energy-delay product versus a 6.2% difference via the best of prior methods.

Part 4 considers power consumption and reliability issues and includes five chapters: 19–23. Chapter 19 addresses the energy and power issues of GPUs. The discussion includes different algorithm implementations, program inputs, core and memory clock frequencies, and the effect of source-code optimizations on energy and power consumption, and the performance of a modern compute GPU. Important things to note is that authors distinguish between compute intensive and memory intensive codes as well as regular and irregular codes. Therefore the focus is on the behavioral differences between these classes of programs. Some examples of software approaches used to modify the energy, power, and runtime of GPU kernels are examined and it is explained how they can be used to improve energy efficiency.

Chapter 20 introduces STT-RAM based L2 cache architecture for GPUs. The key to achieve a high level of performance is the large number of threads that helps GPUs to hide memory access latency with maximum thread-level parallelism (TLP). The downside, however, is that by increasing the TLP and the number of cores, the performance does not necessarily increase due to contentions among threads to access shared resources, such as the last-level cache. For such a reason, future GPUs should have larger last-level cache (L2 in GPU). However, based on the current trends in VLSI technology and GPU architectures, the trend is to increase the number of processing cores. This chapter presents an efficient L2 cache architecture for GPUs based on STT-RAM technology. The positive side of such technology is “high density and low power consumption,” making the STT-RAM technology a good candidate to fulfill the limited area for on-chip memory banks. The downsides, though, are high energy and latency of write operations. Low-retention (LR) time STT-RAMs can reduce the energy and delay of write operations. Investigations show that it is possible to design a two-parted STT-RAM based L2 cache with LR and high-retention (HR) parts. The cache architecture proposed in this chapter is able to improve IPC by up to 171% (20% on average), and reduce the average consumed power by 28.9% in comparison with a conventional L2 cache architecture with equal on-chip area.

Chapter 21 highlights power management in mobile GPUs. Such processors in mobile devices have come a long way in the recent years in terms of their capabilities to accelerate various applications from graphic-specific applications, such as games, to traditional general purpose applications. The downside is the significant power consumption when executing such a wide range of applications. This necessitates sophisticated power management techniques for the GPUs in order to save power and energy in mobile devices that typically have a limited power budget and battery capacity. This chapter first discusses the pros and cons of the state-of-the-art power management techniques for both graphic-specific and general-purpose applications running on GPUs. Particularly, it shows that existing approaches suffer from the lack of coordination in power management between the CPU and the GPU that reside on the same silicon die in the mobile application processors. The authors present a recently proposed power management solution to address these shortcomings through synergistic CPU-GPU execution and power management.

Chapter 22 addresses some recent advances in GPU-reliability research. While the popularity of using GPUs as massively parallel coprocessors has increased dramatically over the last 10 years, improvements in GPU reliability has lagged behind adoption. GPUs are the first candidate for gaming applications where they need attention regarding high reliability. GPUs are now being used to accelerate applications in medical imaging, nuclear science, materials science, social science, finance, etc., and all of them require a high level of reliability. GPUs are also considered a good candidate for high-node-count data centers and many of the fastest supercomputers in the world, exacerbating system failure rates.

Chapter 23 discusses the hardware reliability challenges in GPGPUs. On one hand, graphic applications effectively mask errors and have relaxed requests on computation correctness and that means there has been little attention given to error detection and the fault tolerance of GPUs. On the other hand, HPC applications have some hard limits on execution correctness, and for that reason, reliability is a growing concern in GPGPU architecture design. To address this issue, there is a need to characterize and address the reliability concept in GPGPU design. It is known that there exists two important hardware reliability challenges in GPGPUs. The first is particle strikes that induce soft errors and the second is manufacturing process variations. This chapter explores a set of mechanisms to effectively improve the GPGPUs reliability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.67.235