Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 22

Advances in GPU reliability research

J. Wadden; K. Skadron University of Virginia, Charlottesville, VA, United States

Abstract

While the popularity of using graphics processing units (GPUs) as massively parallel co-processors has increased dramatically over the last 10 years, improvements in GPU reliability have lagged behind adoption. GPUs are first and foremost gaming products, and therefore usually do not demand high reliability. However, GPUs are now being used to accelerate applications in medical imaging, nuclear science, materials science, social science, finance, and more that all require extremely high reliability. GPUs have even made their way into high-node-count data centers and many of the fastest supercomputers in the world, exacerbating system failure rates. Thus GPU reliability in these high-node-count environments has become an urgent, first-class design constraint. This chapter surveys the state-of-the-art research in GPU reliability in the context of the large, existing body of research on CPU reliability techniques.

Keywords

GPGPU; GPU reliability; Resiliency; Redundancy; Transient fault; Soft error

1 Introduction

When writing any type of software, programmers must rely on a set contract that describes the instructions and the exact instruction semantics for a particular architecture. This contract can be as low-level as the instruction set architecture (ISA) for a particular processor, or as high-level as a programming language such as C or Compute Unified Device Architecture (CUDA). Whatever the contract, it must be consulted to ensure correct implementation of an algorithm on the underlying machine model.

While hardware/software contracts may vary between different architectures and programming languages, we generally assume one underlying premise: executed instructions will behave according to the contract. In other words, the contract is never broken by the hardware. This is a simple and perhaps obvious assumption that is often overlooked by programmers. However, in practice, this assumption is impossible to guarantee. For example, an ISA might include an addition instruction that says 1 + 1 will return 2 when executed, but manufacturing defects, voltage noise, malicious attacks, or acts of nature might cause the processor to return 3! Note that some errors may occur from limitations of precision (e.g., in floating point) or from undefined behavior (e.g., ×86 integer overflow), but these errors are known properties of the system and thus considered a part of the contract. In computing environments where breaches of the hardware/software contract are becoming increasingly hard to prevent, we must acknowledge in both hardware and software architecture some nontrivial chance of unexpected, incorrect behavior.

In this chapter, we will motivate GPU reliability as an important, if often overlooked problem, and discuss the current state-of-the-art in GPU reliability research. Unsurprisingly, GPUs are designed and sold primarily to satisfy graphics applications, which traditionally require a low level of reliability. However, computing’s recent adoption of GPUs as high-throughput, data-parallel coprocessors has elevated their reliability requirements almost to that of server processors. Unfortunately, for both economic and practical reasons, architectural enhancements for improving GPU reliability have lagged significantly behind this new use case, motivating intense research focus in this area.

Section 1.1 will first present common reliability terminology, how GPUs fail in real systems, and the difficulty in preventing radiation-induced soft errors. Section 1.2 will present a brief history of GPUs as coprocessors, their economics, and will frame the current issues with GPU reliability. Section 1.3 will present results from studies on real distributed systems and supercomputers showing alarming rates of soft errors, and motivating reliability as a first-class design constraint for future GPUs. The rest of the sections in this chapter will discuss different proposed methods of evaluating and improving GPU reliability. Section 2 will introduce studies of architectural vulnerability factor (AVF) and program vulnerability factor (PVF) modeling (Section 2.1), fault-injection (Section 2.2), and accelerated beam testing (Section 2.3) as methods for evaluating and predicting reliability of real systems. Section 3 will survey state-of-the-art approaches to improving reliability via hardware enhancements to a GPU’s micro-architecture. Section 4 will then survey state-of-the-art approaches to improving reliability via software enhancements to kernel code running on the GPU.

1.1 Radiation-Induced Soft Errors

While we often assume that combinational logic and storage elements on CPUs and GPUs always work correct any processors constructed using a traditional complementary metal oxide semiconductor manufacturing process are vulnerable to both permanent (hard) and transient (soft) failures or faults. Permanent faults are failures that manifest as stuck-at bits in the architecture, that is, lines that always carry the logical signal “0” or “1” as the result of a short or open circuit. These faults can be caused by many different physical processes from normal operation, for example, thermal stress, electromigration, hot carrier injection, gate oxide wear-out, and negative bias temperature instability (see Shubu Mukherjee’s book on architecture design for soft errors [1] for a more detailed discussion of these failure modes). While permanent faults may cause undesirable behavior, they are relatively easy to identify and eliminate because their incorrect output is consistent and they are isolated to a particular piece of hardware. To reduce the likelihood of permanent faults, techniques such as “burn-in” can be used to stress test hardware and identify failures before a processor enters its useful lifetime [1]. Periodic stress and check tests can also be run on the system in question to determine if permanent faults have developed over time [2]. Not only are permanent faults fairly easy to identify, they are also fairly easy to eliminate, because faulty hardware can simply be fixed or replaced. Because of this, researchers generally focus their efforts on protecting CPUs and GPUs against the more insidious transient fault. We therefore focus this chapter on state-of-the-art research into mitigating the effects of transient faults.

As the name suggests, transient faults are not permanent and may cause only a transistor or gate to malfunction for a short period of time. Transient faults can be caused by crosstalk between wires, voltage noise, and other types of electromagnetic interference. However, we typically associate transient faults with random high-energy particle strikes [3]. Particles that strike a chip have the potential to inject charge into the silicon, and if enough charge is injected, the transistor or gate may drive the wrong value temporarily. We call these temporary faults single-event upsets (SEUs) or, if more than one gate is affected, single-event multibit upsets (SEMUs) [1]. SEUs may be harmless if they are not permanently recorded; however, if an SEU occurs directly within a storage cell, such as a flip-flop on-chip, an incorrect value may be captured; if an SEU occurs in combinational logic, it may propagate incorrect values to storage cells. SEUs that are captured and committed to program output are referred to as silent data corruptions (SDC).

While the words fault and error are often used interchangeably, we use the same definitions in Shubu Mukherjee’s book on architecture design for soft errors [1] to distinguish between the two. A fault is any failure, for example, the device, for example, an SEU or SEMU; an error is a user-identified fault within a particular protection domain. A fault that goes undetected but ultimately changes user-visible program output is an SDC. A fault that does not ultimately affect program output is said to be masked. Errors that are known, but cannot be recovered from are referred to as detected unrecoverable errors (DUEs). At first glance, an SDC and DUE may seem equally terrible, in both situations our data have been corrupted! However, it is hard to overstate how much more insidious SDCs are than DUEs. Not knowing if an output is correct or incorrect means that your results may be wrong. For important calculations with high correctness requirements, this is simply unacceptable. SDCs represent a breach in the software/hardware contract that software developers may not be able to tolerate. If we could at least guarantee that all SDCs be converted to DUEs, a system could guide a user to conclude that some portion of their results are useless, and should be discarded, or to engage an appropriate recovery mechanisms such as reverting to the last checkpoint, or restarting the computation. Our goal is therefore to practically eliminate the occurrence of SDCs, and preserve the integrity of the software/hardware contract, while keeping DUEs (that might require time-consuming recovery computation) to a minimum.

Fortunately, transient faults and soft errors are rare occurrences, and may happen on the order of one in tens-of-thousands of operating hours per device. We refer to this as the soft-error rate (SER) of a device, and it is a closely guarded secret of any silicon device vendor. Because typically, suspected SERs are longer than the useful lifetime of many GPUs, and the fact that GPUs are generally used for low-stakes gaming or graphics applications, GPU reliability has traditionally been ignored [4]. In the next section, we will look at evidence to support why GPU reliability has been considered a low priority, the economics hindering GPU vendors from adopting more aggressive soft error protection, and the pitfalls of using unreliable hardware in large-node high-performance computing (HPC) systems.

1.2 GPUs as HPC Coprocessors

Even in high-stakes computation, a single SDC every 5 years might be a low enough SER to ignore. However, in high-node-count data centers or HPC supercomputers, with tens of thousands of nodes, an application using the whole system might experience an SDC on the order of hours or even minutes! This means that any product that is used in HPC (including GPUs) must be extremely robust. Because of this, GPUs were not even considered for high-node-count clusters until after they first gained error correcting code (ECC) support. NVIDIA’s Fermi microarchitecture was the first GPU to include ECC on all off-chip GDDR RAM and on-chip static RAM (SRAM) structures in 2009 [5]. Fermi was immediately adopted for HPC use and powered three out of the top five supercomputers in the world in 2010 [6]. Titan, Oak Ridge National Labs’ supercomputer, the second largest supercomputer in the world as of the writing of this chapter, originally had 18,688 NVIDIA Fermi GPUs, but was upgraded to house newer Kepler GPUs in 2013. Adding features to GPU architectures (such as more resilient cell design or ECC to even more SRAM regions) to gain parity with server CPU reliability standards is a logical next step over ECC support; however, this is prohibitively expensive for the vendor because HPC does not support a large enough proportion of total revenue to justify the cost.

GPGPU-specific products like NVIDIA’s Tesla line are a growing part of overall GPU vendor revenue, but graphics processing units (GPUs) are and will continue to be sold primarily for gaming and graphics applications in the near future. NVIDIA does not publish the proportion of GPU cards sold for GPGPU applications versus gaming, but the company has mentioned yearly revenue from GPGPU products being on the order of “several hundred million dollars” [7]. Given that NVIDIA’s yearly revenue is almost 5 billion dollars, the GPGPU market probably accounts for between 5–8% of total revenue. While this is a significant portion, and is a quickly growing segment of NVIDIA’s total revenue, it is not enough to justify architectural changes to improve reliability if those changes do not also positively impact the graphics business. In the future we may see scientific computing and HPC become healthy enough businesses to justify their own derivative product, but for now GPUs will remain focused on increasing gaming and graphics performance.

1.3 Evidence of Soft Errors in HPC Systems

Because device SER rates are so closely guarded by CPU, GPU, and DRAM vendors, identifying even rough SER estimates for large node supercomputers is challenging, if not impossible. However, many CPUs and GPUs contain special registers that log parity, and ECC events, and other pipeline exceptions. By querying these counters in the field, researchers can get a reasonable approximation for the SER for individual components. In this section we present the state-of-the-art field studies analyzing GPU failures and SER in real systems. Following sections will present other methods for evaluating GPU SER rates, but we present field studies on real HPC systems here to motivate GPU reliability as an important problem and to motivate research into practical GPU reliability solutions.

Field studies examining failures in DRAM because of hard and soft errors were first conducted on large distributed systems by Li et al. [8] and Schroeder et al. [9]. This work showed that hard faults could be distinguished from soft faults by looking at temporal locality of errors. The main insight being that it is extremely unlikely that frequent errors in the same memory location are caused by random and infrequent particle strikes, thus recurring errors in the same location are most likely hard errors. Results showed failures were mainly dominated by hard errors in DRAM modules, but the study ultimately could not reliably distinguish between hard and soft errors without root access to machines. Sridharan et al. conducted a larger and more precise study of DRAM failures on Los Alamos National Lab’s Jaguar supercomputer [10]. Jaguar was an ideal choice for examining hard and SERs for two reasons: (1) it had an extremely high node count (18,688 CPUs) with every node containing identical hardware, and (2) it resides at a high altitude (7500 ft) where cosmic radiation is measurably more intense than at sea level. Results showed that approximately 30% of all DRAM failures could be attributed to soft faults, and that the entire system experienced about 900 faults per month, or a little more than 1 fault per hour! Newer studies suggest that soft faults are a smaller proportion of overall faults, and may be as low as 1.5% when accounting for faults in the entire memory path (including memory controllers, busses, channels), rather than just memory modules [11].

While DRAM is an important piece of a system, the preceding studies did not look at on-die SRAM or logic failure rates in CPUs and GPUs. Modern studies have taken care to distinguish between DRAM and SRAM failures and have found that SRAM faults are dominated by transient faults. Sridharan et al. conducted an experiment on the Cielo and Jaguar supercomputers that showed 98% of faults in L2 and L3 SRAM structures were transient faults [12]. Interestingly, the same study also showed a correlation between error rates and position of the processor in a server rack. The higher the processor, the more likely it was to experience a soft fault, indicating that CPUs in lower racks may be shielded from cosmic rays by the upper racks.

The first large-scale field study of GPU failure rates was conducted on the Blue Waters supercomputer at the University of Illinois [13]. The study showed that GPU DDR5 had 10× more soft errors than the DDR3 main memory and that GPU cards were by far the least reliable component in the system, but the authors did not study on-chip SRAM failures. Haque et al. [2] conducted a study on the Folding@Home network of distributed computers looking at DRAM memory errors in consumer-grade GPU cards. Controlled experiments found no errors, but examination of consumer-grade GPUs from the large scale, 20,000 GPU-distributed system revealed patterned hard errors in two-thirds of the cards. These errors were highly correlated with GPU architecture, indicating that the memory systems of more modern GPU architectures and server-grade compute GPUs such as the NVIDIA’s Tesla line were indeed more reliable than gaming products—although the reason for this difference was not identified. More recently, a study looked at GPU DRAM and on-chip SRAM error rates in nodes on the Titan supercomputer at Oak Ridge National Labs and the Moonlight cluster at Los Alamos National Labs. Experiments on the Titan showed that over 18 months, 899 of the 18,688 GPUs experienced at least 1 error, an average of 1.66 per day across the entire system [14].

Ultimately, because SDCs must be identified by comparing program output to a gold standard, the SERs identified in these studies underestimate the total number of data corruptions. While the SDC and SER rates are hard to experimentally verify, the next chapter discusses how to evaluate how vulnerable GPUs are to soft errors using accelerated particle beam testing, and simulated fault-injection vulnerability analysis.

2 Evaluating GPU Reliability

Field studies give us a very broad, blunt tool to examine both hard and soft faults on systems, but the practical difficulty of conducting such experiments and the imprecise nature of their results and conclusions lead researchers to examine artificially injected faults in architectural simulations as an alternative tool to asses processor and program reliability. Such simulations allow architects to glean information about reliability of structures in an architecture and the relative susceptibility of programs to soft errors, without the need for large and expensive experiments on real hardware. In the extreme case, researchers will use particle accelerators to greatly increase the SER rate in a single chip in a real system. “Beam tests” can quickly isolate the effects of hard errors from soft errors and isolate failures in a single piece of silicon from failures in the rest of the system, without the need for an arduous field study. The following sections present current efforts in GPU software simulation and architectural vulnerability analysis in simulation, fault-injection tools for simulating faults in real hardware, and current, state-of-the-art beam testing practice.

2.1 Architectural and Program Vulnerability Analysis

When adding any features to a processor, simulation is often used to perform a cost-benefit analysis. Does the feature increase instructions per cycle (IPC) in benchmark suites enough to justify the transistor real estate and engineering effort necessary to include it in the next iteration? This analysis is relatively straightforward in simulation because one can easily measure improvements in IPC, memory system performance, and utilization, but how would we quantify increased reliability? Architectural vulnerability analysis attempts this feat by tracking every bit in an architecture during the run of a program and calculating how likely it is to affect the program output.

Consider a bit stored in any sequential logic or memory cell in a processor; if we change the value of that bit, will it cause incorrect program output? Naively we might assume “of course!” but there are many structures in CPUs and GPUs where state is either stored solely for performance enhancement, or never read at all. Take, for instance, a branch target buffer where predicted addresses are stored in an SRAM structure on-chip. If we flip a bit in a target address, how will this affect the correctness of the program if that target address is read as a prediction? If the address is a correct prediction before the SEU and the bit-flip creates an incorrect prediction, then we have simply wasted a performance opportunity and forced a mispredict. In the off chance it was an incorrect prediction, and the bit-flip created a correct prediction, then the fault accidentally increased performance. In either case, the performance of the program may have been affected, but the correctness was never compromised. In this sense, the bits in a branch target buffer are invulnerable, in that they are not required for architecturally correct execution (ACE). We refer to these invulnerable bits as un-ACE, and any bits required for correct execution as ACE. All idle or invalid state, misspeculated state, and predictor structures are un-ACE. State that is never read is un-ACE, and instructions that produce state that is never read are un-ACE! We refer to these instructions that never affect program output as dynamically dead. We refer to instructions whose results are used only by dynamically dead instructions as transitively dynamically dead, as they are also not required for correct execution.

An example of a structure where bits are ACE most of the time is the CPU’s program counter. If we flip a bit in the program counter, will it affect correct execution? Probably! Jumping to random locations in a program would almost certainly cause incorrect output or a program exception. However, some bits, even those in the program counter, are not ACE all the time. Consider the program counter just before a new instruction address is written to the register. The program counter could literally have any value before being written to, and therefore the bits are un-ACE until immediately after the write. An entire instruction becomes dynamically dead if its results are never read or never used for ACE computation, for example, if the program halts.

Because the ACEness of a bit or instruction may depend on dynamic behavior, we use the proportion of cycles a bit is ACE over the total number of program cycles as a measure of vulnerability to soft errors. This ratio is called the architectural vulnerability factor or AVF and is a measure of the likelihood a change in any individual bit will result in incorrect program output. Because practically calculating the AVF of every individual bit in an architecture requires fast and accurate register transfer language (RTL) models of processors, researchers often use Little’s Law [1] to calculate the average AVF of bits in a structure. Little’s Law states that the AVF of a bit in a structure is the bandwidth of ACE bits entering the structure per cycle, times the average number of cycles a bit resides in the structure, divided by the total number of bits in the structure. This method of AVF calculation allows for relatively easy integration into existing performance simulators. Once estimated, AVFs can be used to navigate the cost/benefit of enhancing the reliability of structures in a CPU or GPU. For instance, why add expensive ECC to the entries of the branch target buffer if it has an AVF near zero and is rarely required for correct execution?

Because AVF is derived from runs of many different programs, and ignores program specific reliability, Sridharan et al. introduced the concept of PVF [15]. PVF is a micro-architecturally independent measurement of architectural program reliability. Sridharan adapted ACE analysis such that the reliability of software resources can be measured, just as traditional ACE analysis can identify the AVF of hardware resources of a micro-architecture. PVF works by calculating the ACEness of instruction-level resources, such as a register, during the lifetime of that instruction, until it is committed. In this way, PVF calculation of architectural program resources (such as registers) can approximate the vulnerability of the program across any micro-architecture.

CPU AVF analysis has been well researched since it was introduced by Shubu Mukherjee in 2003 [16]. Results from their simulations of an Itanium2 processor found that only 46% of instructions were ACE for combined integer and floating-point benchmarks [1]. Today, AVF continues to be an important part of reliability, availability, and serviceability (RAS) architecture strategy.

GPU AVF analysis was only recently motivated after GPU inclusion into HPC systems. Previously, single node reliability was never important enough to justify fine-grained analysis. When first included in HPC systems, GPU architects simply added single error correct, double error detect (SEC-DED) ECC on all of their large SRAM structures such as the instruction and data L1 caches, L2 cache, and register files. This is a reasonable but conservative and expensive strategy for reliability, and does not address the reliability of every part of a GPU’s pipeline. Ideally AVF analysis could identify which high-priority structures require the most protection, and conversely which presumed high-priority structures are overprotected.

Tan et al. [17] and Farazmand et al. [18] were the first to study AVFs of GPU structures. Their studies showed that GPU AVFs varied highly depending on the studied software and the software’s relative utilization of different hardware structures on the GPU, where higher utilization of a structure corresponded with a higher AVF of that structure. Farazmand et al. were the first to study GPU AVF in comparison to CPUs and found that the AVFs of relatively larger GPU structures, like the GPU’s massive vector register files, were usually much lower than corresponding structures in CPUs. These low AVFs can be attributed to low thread occupancy, or low utilization of the register file, or the GPU’s scratchpad memory. Intuitively, this makes sense; if a GPU streaming multiprocessor (SM) contains a massive 256 KB register file, but threads are only allocated 25 KB of registers, the AVF for that structure should be approximately 10× lower than a fully utilized structure. In contrast, a CPU’s register file state is generally much smaller, on the order of 1 KB, but usually has a much higher utilization than a GPU’s register file.

Jeon et al. [19] performed the first integrated CPU-GPU AVF analysis (called accelerated processing units or APUs in Advanced Micro Devices (AMD) terminology) and showed how AVFs of structures in the CPU and integrated GPU varied over time. The motivation for this study lies in the uniqueness of a truly heterogeneous integrated system with a shared memory hierarchy, where both CPU and GPU AVF must be accounted for wholistically. This work presented two interesting conclusions that highlight the complexity of GPU AVF in contrast to CPU AVF analysis.

(1) AVF of massive GPU structures is generally low, but corresponding SERs can be extremely high. Similar to prior studies, GPU register file AVF was found to be very low (around 0–1.5% depending on the application) because of low utilization in the structure, that is, any individual bit-flip has a low likelihood of affecting program output. However, even though the per-bit AVF of the register file may be low, because a GPU’s register file is so large, the overall failure rate of the structure is extremely high. To calculate the SER of a structure, per-bit AVF must be multiplied by the size of the structure and the transistor process dependent raw failure rate. Jeon et al. found that after adjusting for these factors, the GPU’s, register file SER was up to three orders of magnitude larger than the CPU’s, due primarily to the fact that a CPU’s register file is hundreds of times smaller than the register file of even a single GPU SM or compute unit (CU). This finding motivates higher levels of protection on the GPU’s register file, rather than lower levels the simple AVF would suggest. As an architect, this is frustrating. We would like our ECC to have a large bang for its buck, but in the case of the massive register files in GPUs, ECC is necessary on every bit but rarely taken advantage of because of GPU register file underutilization.

(2) AVF is highly time dependent in an integrated CPU-GPU architecture. If the CPU is doing computation, and not offloading computation to the GPU, the AVF of a GPU’s structures is implicitly zero. On the other hand, when the GPU is doing computation, the AVF or SER of structures can be extremely high. Just as the size of GPU structures decreases the “bang for the buck” of ECC, this “bursty” nature of GPU vulnerability also reduces the “bang for the buck” of ECC protection. While average AVF may be low over the runtime of an application across GPUs and CPUs, some GPU structures intermittently require very high reliability and protection. Thus ECC is a required addition to the architecture, but is rarely used on average.

2.2 Fault-Injection

AVF analysis is extremely useful for doing fast design-space exploration and for setting early upper bounds on the vulnerability of processor structures. However, it is fundamentally a conservative analysis, as it assumes that a bit is ACE until it is proven to be otherwise. Both a lack of micro-architectural detail in performance simulators and a difficulty in adding such detail (either because of a lack of proprietary knowledge or prohibitive engineering effort) contribute to underestimating the practical reliability of the architecture [20]. Furthermore, even if one has access to a representative cycle-accurate simulator, or RTL description, AVF analysis fails to account for fault-induced changes in program behavior that may not actually affect program output. AVF analysis usually does not consider a program’s final result and instead analytically arrives at a probability of incorrect execution, whether or not that incorrect execution causes a user-visible error.

A much more straightforward, but more reactive, approach to examining reliability of architectures is fault-injection. Fault-injection randomly induces a bit-flip into a running program, either architecturally in the ISA, micro-architecturally into a structure such as a register file, or into a storage bit in an architecture at the RTL model level. Once a fault is injected, the fault-injection tool looks for changes in user-visible output by comparing to a precomputed golden run; if a change is exhibited, the fault is manifested as an SDC, and is recorded. By looking at injected faults and how they affect program output, architects can get a broad sense for how reliable structures are, at least in relation to each other.

Fault-injection can be done on either presilicon (such as on an RTL-level processor simulator) or on real hardware to evaluate the reliability of production GPUs. In simulation, fault-injection is useful for evaluating the reliability gained from proposed reliability enhancing structures such as ECC, redundant execution units, or redundantly executed code, things which AVF and ACE analysis do not consider. In our upcoming discussion, we will present tools capable of performing fault-injection on real GPUs. Simulated fault-injection studies and methodologies will be discussed as a part of the proposed hardware-based reliability research.

Performing fault-injection is a fairly simple task: stop the simulation or processor, pick a structure in the GPU to inject a fault in a systematic and controlled manner, modify that structure to reflect a hypothetical transient fault, let the program run, and record whether or not the output recorded an SDC. While simple, fault-injection can be extremely time consuming. There are an extremely large number of configurations that the GPU could be in at any one time. On every cycle, any one of literally thousands to hundreds of thousands of GPU thread contexts could be affected. Therefore, fault-injection experiments on both CPUs and GPUs suffer from a dilemma. We can either perform a fast experiment with a low number of injected faults, but at the risk of not properly characterizing the reliability of the GPU, or we can spend a large amount of computational power, over tens of thousands of trials, over many benchmarks, properly characterizing the reliability of a GPU. For the purposes of properly executed science, researchers, begrudgingly, are forced to choose the latter option.

To address this issue, researchers have decided to inject faults at a higher level of abstraction than at the RTL model layer. Yim et al. developed SWIFI [21], a fault-injection framework that uses a source-to-source translator to modify programs before they are run. SWIFI was able to perform 10,000 fault-injection experiments per tested application. Unfortunately, SWIFI’s source-level software abstraction is too coarse-grained to accurately reflect the impact of faults at the micro-architectural level or the assembly instruction layer.

GPU-Qin [22], developed by Bo et al, attempts to solve this problem by injecting faults at the GPU architectural layer, using the debugging tool cuda-gdb [23]. Cuda-gdb can be used as an external hook into a running CUDA program to selectively trace and modify values in the CUDA programs code or data. At runtime, GPU-Qin first profiles the CUDA application and classifies threads into groups with identical program traces. For each fault-injection run, GPU-Qin then selects a thread from each group and identifies an instruction to halt the program and inject a fault. Faults are injected in proportion to the size of the groups to reflect to overall proportion of threads executing on the GPU. Instructions are chosen uniformly over time to represent the uniformly random nature of radiation-induced transient faults. To simulate a fault in an arithmetic instruction, faults are injected into the destination operand of the instruction. To simulate a fault in a memory instruction, GPU-Qin injects faults into the destination operand or the address operand for load/store instructions, respectively. For control flow instructions, cuda-gdb does not allow for the modification of predicate registers, which control the behavior of individual threads within a CUDA single-instruction multiple-data (SIMD) instruction or “warp”; therefore GPU-Qin injects a fault into the control flow instructions of an entire warp. Once a fault is injected, the outcome of the execution (crash, program hang, SDC, or normal) is recorded.

SASSI [24] is a compiler-based GPU assembly instrumentation framework, which also leverages cuda-gdb. SASSIFI [25] is a fault-injection framework built around SASSI, which has similar functionality to GPU-Qin. SASSIFI randomly injects single bit-flips into the destination values of all executing instructions. Unlike GPU-Qin, SASSIFI is unbiased, meaning that each instruction is given equal weight, rather than weighting injections based on the number of times an instruction is dynamically executed like GPU-Qin.

All of these tools are extremely helpful for characterizing the reliability of existing architectures and identifying particular properties of programs that make them more unreliable. However, current GPU fault-injection tools are in their infancy and still need extensive validation and correlation with physical fault models. In the next section, we will take fault-injection to its extreme and look at studies that physically induce transient faults into real hardware using accelerated neutron or alpha particle beams.

2.3 Accelerated Particle Beam Testing

While AVF analysis is useful for bounding reliability of structures early in the design process and fault-injection in hardware is useful for analyzing how faults affect program output, both of these techniques ignore the physical rate and manifestation of actual soft errors. Ideally, just as in field studies of large node supercomputers, we want to design experiments to identify both the rate and effects of transient faults on processors, rather than just the vulnerability of a particular architecture. Field studies of soft errors in real systems (previously presented) can roughly identify how vulnerable individual nodes are, but these field studies are extremely difficult to do, requiring thousands of nodes and years of observation to acquire data that often does not correspond to processor reliability. Also, because field studies are done on real systems, this limits the experiments that can be performed, as they may interfere with higher priority user-level programs and experiments. To overcome these issues, researchers turned to radioactive particle sources and particle accelerators to approximate the effects of soft errors from natural sources in real systems.

By placing silicon in the path of a beam of alpha particles or neutrons, researchers can get a sense for the effects of soft errors in nature in a fraction of the time. Dubbed “beam tests,” these experiments can isolate effects of radiation-induced faults on a single silicon die [26]. Accelerated alpha particle beam tests can be accomplished by exposing silicon to radioactive sources that emit alpha particles (such as thorium 232) and must be placed in close proximity to the actual silicon as alpha particles easily interact with packaging. Accelerated neutron beam tests are accomplished by placing silicon directly in the path of a neutron high-energy neutron source. To approximate the effects of atmospheric radiation on silicon, the energy profile of the neutrons in the neutron beam should at least approximately match those experienced on the Earth’s surface. Such beams are called “white neutron sources” and exist at several US national laboratories. The neutron beam at the Los Alamos Neutron Science Center (LANSCE) is often sourced for such tests.

Beam tests are extremely useful as they allow a real-world evaluation of current silicon reliability, trends in silicon reliability over generations of products, and evaluation of the efficacy of software and hardware reliability mechanisms, but they also have major drawbacks. Firstly, they are extremely expensive and inconvenient. While not nearly as expensive and time consuming as collecting field data from performance counters on large supercomputers, access to neutron beam time is still expensive and competitive. Secondly, unlike in fault-injection, where the source and rate of faults that manifest as errors is exactly known, beam tests still require root cause analysis. We can probably assume that a user-visible error during a beam test was most likely caused by a transient fault induced by the neutron source, but exactly in what structure and on which cycle the transient fault occurred might be unknowable. Error causes and locations can be approximated by checking special GPU exception registers. SDC rate can be approximated by comparing program output to a gold standard. In the remainder of this section, we will look at the only published study of accelerated beam testing on GPUs, and examine implications of this study on GPU architecture.

The first, and only, published study on accelerated neutron beam testing on GPUs was done by Tiwari et al. using the white (i.e., approximating the energy distribution of cosmic rays) neutron beam at LANSCE [14]. Because errors in systems might be caused by many different sources, such as application or driver bugs, Tiwari et al. focused on gathering data reported by internal exception registers in NVIDIA GPUs. Two systems were set up with two different NVIDIA GPUs, one Fermi-based C2050 and one Kepler-based K20. Each GPU was suspended in the line-of-fire of the beam and connected to a motherboard using a PCIe extender cable. To measure the raw sensitivity of the SRAM structures on-chip to transient faults, ECC support on internal SRAMs was disabled, patterns of 0s and 1s were stored to these structures, and the chip was exposed to the neutron beam. The researchers then calculated the cross-section of the structures, which is the number of errors in the structure divided by the neutron fluence (flux per unit area). Per-bit cross-sections were calculated using patterns of all 0s and all 1s and reported for each GPU’s L2 cache and register file. Three interesting results were highlighted from this experiment.

(1) Per-bit failure rates for the older 40 nm C2050 Fermi architecture was 2× to 3× higher than the newer 28 nm Kepler-based K20. These failure rates contradict many assumptions that reliability automatically decreases as transistor feature sizes decrease, and were instead attributed to better bit-cell design, although the exact details of the improvement are proprietary. When normalized for the larger sizes of the L2 and register file in Kepler, the K20 still was found to be more resilient to errors.

(2) Bits in the L2 that are set to 0 are 40% more likely to be corrupted than bits set to 1. This trend was not seen for the register file, and thus indicates that asymmetric design of the cells in the cache (vs the symmetric bit-cell design in the register file) is what caused the discrepancy.
Both of these results highlight the need for attention to bit-cell design to increase base GPU resiliency. Unfortunately, such changes are only one-time adjustments, and the ability to continue to increase cell-level reliability cheaply is improbable. These results also suggest that fault-injection tools should account for this discrepancy in future studies.

(3) NVIDIA’s K20 GPU is six times more likely than the C2050 to experience a double-bit error (DBE) from a particle strike. This is hypothesized to be due to the fact that the K20 uses a 28 nm process, while the C2050 uses a 40 nm process. Because transistors in the K20 are smaller and have a lower critical charge, it is easier for a neutron to interact with multiple bit cells. Overall, DBE rate was relatively low in both GPUs, 1% for the C2050 and 6% for the K20, and no triple-bit errors were identified. Thus SEC-DED ECC for SRAM protection is effective for current process node GPUs. However, as GPUs are implemented in 16 nm and smaller process nodes, the DBE rate and the occurrence of triple-bit errors will most likely rise, keeping cell design constant. Even though SEC-DED ECC is sufficient protection for SRAMs on current GPUs, it will most likely not be sufficient protection in the future.

The preceding experiment was only designed to test the sensitivity of writable SRAM structures. To evaluate the susceptibility of the entire architecture to transient faults, Tiwari et al. ran a set of benchmarks while exposing the GPU to the neutron beam. Two failure modes were observed: crashes, where the GPU program either failed or timed out, and incorrect output (SDC), where the result of the computation failed to match that of a precomputed golden output. Results showed a wide range of failure rates that ultimately depend on the resource utilization and AVF of a particular application, and the accelerated data supports the crash rates seen in the field data mentioned previously. Two interesting observations were made by the authors about this data:

(1) ECC reduces the SDC rate by up to 10×, but increases crashes because of ECC exceptions from benign DBE that cannot be corrected. Because faults may not occur in ACE bits, ECC is fundamentally a conservative mechanism. It does not distinguish between benign DBEs faults (false positives) and faults that will propagate to user-visible output. Therefore, even with higher protection from SDCs, (the priority) crash rates can increase. This is not a bad thing. An order of magnitude decrease in SDC rate is a welcome (and perhaps a required) improvement for usability, even at the cost of an increased crash rate.

(2) Pipeline logic, queues, flip-flops, and schedulers are all unprotected. ECC is implemented only on caches and the register files in modern GPUs, leaving the rest of the architecture vulnerable to the same kinds of upsets. Tiwari et al. showed that single-bit SRAM faults account for up to ∼90% of all SDCs, meaning that more than 10% of SDCs, a nonnegligible amount, were due to faults in other structures.

(3) Compute-bound kernels such as dgemm have higher crash rates. Of the two matrix multiplication kernels (dgemm and MxM) dgemm, which is compute-bound, had a 2× higher crash rate than the memory-bound MxM. The authors hypothesize that this is because compute-bound applications have higher utilization (and therefore AVF) of unprotected structures such as the instruction scheduler and instruction queues, leading to a higher rate of control-flow errors and crashes.

While the preceding experiments show ECC is most likely good enough for protecting SRAM structures in current generation GPUs, it does not come without its drawbacks. Firstly, ECC is inherently eager in its assumption that an error will be user-visible, and therefore will lead to false positives. Secondly, ECC is not able to cover pipeline logic or other combinational circuits. Thirdly, ECC is not always required; architects pay an expensive area and power cost for a structure that sometimes may be rarely used or required. Researchers have attempted to find solutions to these drawbacks, identifying cheaper solutions that provide higher reliability for GPUs. In the following sections, we will examine research in both enhancements to the hardware and purely software techniques to improve reliability over current SEC-DED ECC methods.

3 Hardware Reliability Enhancements

The previous sections surveyed the current state of the art in evaluation of GPU reliability and showed that while GPUs themselves are not that much different from CPUs physically, their organization (e.g., the size of the register file, SIMD execution) makes addressing their reliability needs a little more complex. In the next few sections, we will discuss currently known techniques for enhancing GPU reliability using modifications to the hardware and new research on low-cost architectural and circuit techniques for improved protection.

3.1 Protecting SRAM Structures

We mentioned ECC in previous sections as a solution for protecting memory cells, such as a GPU’s cache and register file, from soft errors. Instead of rehashing how ECC works [1], we instead briefly give an overview on ECC functionality, types of ECC and its overhead and protection, and research on how to make error coding schemes cheaper and more effective. ECC is normally implemented as SEC-DED code meaning that values with single bit-flips can be recovered (i.e., an SDC can be converted to the original value and execution can continue), while double bit-flips can be at least detected (i.e., an SDC can be converted to a DUE), and cause an exception, usually leading to a program crash. SEC-DED ECC accomplishes this by using a unique code word based on the data and a parity check matrix shared among all code words. When multiplied together, the parity check matrix and a code word can identify the location of a single flipped bit in the data. Once identified, correction is simple, flip the bit again! SEC-DED ECC can detect whether a DBE occurred, but it cannot identify the locations of the two bits, and therefore cannot correct the error. The size of the code word determines how much change can be detected and corrected. Double-error correct, triple-error detect (DEC-TED) ECC and any other granularity of ECC is implementable, but with a harsh penalty in the state required for each code word. For a 64-bit word, we need at least 7 bits to implement SEC-DED ECC (a 7% overhead) and 13 bits to implement DED-TED ECC (a 22% overhead) [1]. Many other error correction coding techniques exist that are able to correct more bits, with varying performance and state trade-offs, but we omit discussion of these codes and refer the reader to Shubu Mukherjee’s book on architecture design for soft errors [1] for further study.

NVIDIA’s Fermi architecture was the first NVIDIA GPU microarchitecture to support SEC-DED ECC protection on its SRAMs. Although exact details of the implementation are unavailable, assuming conservatively that all ECC protection is at double-word granularity (64 bits) for all caches, scratchpads, and register files, this translates to an 11% overhead in area! NVIDIA’s Kepler architecture most likely did not drastically change the ECC organization for the register file, scratchpad, L1, or L2, but did include a clever optimization to reduce the need for ECC on its read-only data cache. Because inclusive caches are essentially redundant, lower-level inclusive caches do not need ECC, and only need parity to detect whether an error has occurred. If an error is detected, the line in the inclusive cache can simply be invalidated. Unfortunately, because of the relaxed memory consistency model of GPUs, both L1 cache designs, as well as the scratchpads, of current generations of AMD and NVIDIA GPUs are exclusive. However, the read-only data cache in Kepler is inclusive, and therefore requires parity protection for only single bit correction [27].

Tan et al. [28] observed via AVF analysis that the register files and instruction buffers of GPUs carry the most architectural state and are essentially reliability “hot-spots” in the GPU. To avoid the high cost associated with adding ECC protection to a GPU’s large register files, Tan proposed using the GPU’s often underutilized scratchpad memory to store ECCs for the register file. Dubbed SHARP (SHAred memory to Register Protection), the technique adds hardware structures for ECC generation and recording and a checker structure to identify possible errors in the register file. SHARP reduces normalized AVF of the register file by an average of 41%, supplying similar protection as ECC, with an estimated 18% reduction in power. To avoid the cost of adding ECC protection to the GPU’s instruction buffers, Tan observed that many instructions in a GPU’s streaming architecture share the same PC and can therefore share ECCs. Tan proposed a technique called SAWP (Similarity-AWare Protection) to take advantage of this phenomenon. SAWP adds a small ECC table to the hardware as well as an ECC generator, a checker, and hardware to support identifying different SIMD instructions that share the same PC. SAWP reduced the AVF of the instruction buffer by 68% on average, and has an estimated 17% reduction in power over a full SEC-DED ECC implementation.

3.2 Protecting Pipeline Structures

ECC has proved to be an extremely powerful, successful protection mechanism and has become a necessary requirement for any architecture expecting to satisfy consumers demanding reliable computation. However, ECC suffers from two main drawbacks; it is limited in the number of errors it can detect and correct, and it only protects SRAM structures on-chip and does not protect pipeline logic or other processor structures. In this section we will discuss new research on proposed hardware mechanisms that enhance reliability of nonstorage logic in GPUs.

Sheaffer et al. [29] proposed some purely architectural-level techniques to address the inherent unreliability of shader arithmetic logic units (ALUs) and raster operations units in early GPGPU pipelines. By replicating ALUs and placing a comparator on the critical path, they built in redundant shader computation with more than 2× area overhead on the many shaders in a GPU. To protect the raster hardware, fragments are issued to the raster pipeline twice, essentially hardware-assisted redundant instructions. To support reissue of incorrect computation, Sheaffer et al. add a domain buffer to the data path to store inputs. If an error is detected in the raster operations unit, the fragments are reissued into the pipeline. Redundant hardware is expensive, though, and must be paid for by casual users such as gamers who do not demand the strictest reliability. Schaeffer et al. reported a ∼50% performance degradation because of reissued operations, but noted that this was much less than the 100% overhead expected from naive redundant execution. This was due to increased memory locality and thus more ideal cache behavior for the second of the two redundant instructions.

3.3 Taking Advantage of Underutilization

Because most modern GPUs use an SIMD model for computation, GPUs suffer from the same classic disadvantages of SIMD processors. One particular inefficiency that arises with lock-stepped execution of threads in SIMD architectures is that they are hard to utilize to their full potential, and are often underutilized if the program has irregular parallelism. Software may have dynamically changing needs for parallelism, but the GPU’s SIMD units, load/store units, and special functional units (SFUs) are obviously fixed resources.

Underutilization of GPU SM resources in NVIDIA GPUs happens for two main reasons: (1) underutilization within SIMD units because of mismatches between thread-block size and SIMD width, thread divergence, memory stalls, and so on, and (2) underutilization of entire SIMD, load/store units, or SFUs because of a lack of ready instructions from the instruction scheduler. Jeon et al. [30] attempt to take advantage of both of these sources of underutilization to improve GPU reliability by opportunistically executing redundant computation in idle GPU SMs, or SIMD units. Dubbed Warped-DMR (DMR stands for dual-modular redundancy), the technique offers two main operating modes: intrawarp and interwarp DMR. Intrawarp DMR takes advantage of underutilization within SIMD instructions or “warps.” If any of the threads in a warp are inactive (i.e., masked from execution because of SIMD divergence), intrawarp DMR uses the inactive threads (empty SIMD lanes) to verify the execution of the active threads. If all of the threads in a warp are active, and intrawarp DMR cannot execute, interwarp DMR reexecutes the SIMD instruction in an idle execution unit. Warped-DMR cannot be done efficiently in pure software; therefore, Jeon introduces architectural support to facilitate efficient redundant execution. For intrawarp DMR, registers from active threads must be forwarded to the inactive threads doing redundant computation, thus a register forwarding unit is added to the architecture, which funnels correct data to redundant lanes. Furthermore, comparators are included to compare the results of redundant execution. Interwarp DMR requires the addition of a replay queue to buffer executed instructions for reexecution in idle execution units. In their work, Jeon et al. [30] set the size of the replay queue to be 5KB for each SM; while this is not extremely large, it is not an insignificant addition in area and power to the GPU SM architecture. A replay checker also needs to be added to the SM to coordinate redundant execution and validation of correct execution. Error coverage is reported as the percentage of instructions that execute redundantly. Warped-DMR has 96% instruction coverage, with only 16% performance overhead (mainly from interwarp DMR) on average. The lower the utilization of the SIMD units, the lower the overhead of Warped-DMR.

Tan et al. [31] propose a similar approach to Jeon et al. called RISE (Recycling the streaming processors Idle time for Soft-Error detection). Full-RISE uses the same intuition that warps themselves often stall to wait on long-latency operations such as memory accesses. Request pending aware full-RISE (RP-FRISE) takes advantage of this latency by predicting long latency operations. Once predicted, RP-FRISE inserts extra redundant computation after the long latency operation. Full-RISE requires three additions to the GPU architecture to support redundant execution. The first is a storage array and an ALU for keeping track of and calculating the latency of different memory requests. Once this latency is calculated, it is sent to the SM along with the requested data, and an appropriate number of redundant instructions are executed. Tan et al. take this methodology one step further and proposes idle SM aware Full-Rise (IS-FRISE). IS-FRISE looks at imbalances in how SMs on a GPU are processing thread-blocks. If an SM is known to be underutilized (because an SM can handle more thread-blocks than were scheduled to it), the work of a thread-block in that SM is duplicated. The hardware overhead required to support RP-FRISE and IS-FRISE is small, calculated to be only 3% additional area in each memory controller, and 1% extra area in each SM.

Tan et al. also propose Partial-RISE, which takes advantage of empty SIMD lanes due to thread divergence to execute redundant computation with little or no extra cost. In contrast to intrawarp DMR, which replicated computation within warps, Partial-RISE works by looking for SIMD instructions that diverge and then picking a ready SIMD instruction with the same PC to execute. Threads from the ready instruction are replicated into the empty lanes of the divergent warp, replicating their computation. Once the first warp is executed, results from the redundant threads are stored in a buffer and compared to the results of the second warp once it finishes its execution. To support Partial-RISE, comparators for comparing the active masks of two warps with the same PC and buffers to hold temporary results that must be added to the SM pipeline. This extra hardware leads to a 1% area overhead per SM. Both performance degradation and AVF were calculated for Full-Rise and Partial-Rise techniques. When combined, Full-RISE and Partial-RISE techniques cause a miniscule 4% performance degradation because of extra computation, and reduce the AVF of a particular SIMD processor by 43%.

Argus-G [32] is an extension of Argus [33], the hardware error-detection scheme developed for CPUs, retargeted specifically for GPUs. Argus makes the observation that von-Neumann architectures do four basic computations: (1) control flow, (2) computation, (3) data flow, and (4) memory. If these four operations can be checked, Argus can detect all errors in the core. Argus-G does this by extending an SM architecture with control flow, dataflow, and computation checkers. Data flow checking is accomplished by comparing the dynamic data flow graph of each thread to a precomputed static data flow graph. If the graphs differ, an error occurred in the monitored thread’s execution. This analysis implicitly checks control flow within basic blocks, but does not account for control flow between basic blocks. Control flow between basic blocks is checked by calculating both a data flow and a control flow signature. A check for the signature for each legal predecessor block is then included in the code of each basic block just before the branch or exit instruction. If the signature does not match, a control flow error was encountered. Computation is checked using the identical method used in Argus, by simply replicating functional units in hardware. More efficient checkers can be implemented, and this is an open and active area of research. Argus-G uses the identical computation checking method as Argus, although each checker is implemented many more times to account for the wider SIMD functional units in GPU hardware. Both Argus and Argus-G claim to have 100% coverage over the architecture and can provably catch all transient and permanent errors in a core; however, in later experiments, the same authors reported that 7% of injected faults ended up as SDCs and that Argus-G was able to detect only 66% of injected faults that would have ended up as SDCs if not for Argus-G Performance of GPU applications utilizing Argus-G was mainly impacted by the insertion of signature checking code at the end of basic blocks. For some applications, this overhead was not insignificant (close to 20%), but for most, overhead was negligible. On average runtime increased by only 4% over all tested benchmarks. Applications with smaller basic blocks have higher performance penalties because of an increased ratio of signature code to useful code.

All of these techniques provide fairly impressive improvements in reliability, with relatively low overheads. It is reasonable to think that some customers would sacrifice 10% performance on average for large improvements in reliability, especially as data-center GPU and HPC GPU use increases. However, in practical terms, most if not all of these research efforts are unlikely to be implemented in mainstream GPUs. The economics of GPU organization and design are driven mainly by consumers who do not require reliable computation: gamers. In the heated competition between AMD and NVIDIA for the gaming and professional graphics market, there is little to no room to spend engineering dollars on ultra-reliable hardware. Having said that GPUs are being used in supercomputers! This motivates the need for configurable software reliability solutions that can be turned on and paid for by those who desire it, but does not increase the cost of GPU design and manufacture. The next section deals exclusively with software reliability solutions and the attempt by researchers to identify cheap and practical alternatives for hardware reliability solutions.

4 Software Reliability Enhancements

Perhaps the most obvious method for providing cheap, hardware-agnostic GPU reliability is to run a GPU application twice. If the outputs of the two runs differ, then an error must have occurred somewhere in the computation of one of the program runs, and users know that they must run their application again. If the outputs of the two program runs agree, an error most likely did not occur, and a single copy can be committed as program output. As previously mentioned, this technique is referred to as dual-modular redundancy or DMR. DMR can detect nearly 100% of errors, but note that DMR cannot correct errors. If an error is detected via DMR, the application in question must be reexecuted from a recoverable checkpoint, as we have no evidence of which thread encountered the error! For DMR techniques, any structures that are used in redundant execution are said to be within the Sphere of Replication or SoR, and are generally assumed to have 100% coverage. This is not entirely true as a permanent fault may cause the same error to manifest in both redundant threads (called the hidden error problem), but as discussed previously, we are focused on the detection and mitigation of soft faults and assume the hidden error problem for transient faults is statistically impossible. Later on we will see how the hidden error problem can emerge when considering soft faults in SIMD architectures. Values that enter the SoR must be replicated, called “input replication,” and redundant value pairs that leave the SoR must be shared and compared before a single value is allowed to exit, called “output comparison.”

While DMR provides nearly 100% coverage of all faults on every structure in the SoR, it is expensive. Full redundant execution naively either requires twice as much time for computation or twice as much hardware. Furthermore, the sharing and comparison of program output between redundant threads can be expensive, adding to this assumed 2× overhead. Even with this high cost, program reexecution is used as a practical, common sense approach for verification of program runs with high-reliability needs. A 2× or even 3× slowdown is expensive, but may be a quick and dirty option for scientists or engineers who must guarantee correctness, but do not have access to other hardware or software options. This high overhead can generally be improved upon by executing redundant computation at the same time, amortizing the cost of data and instruction fetch over two computations. In the next section, we will look at some research techniques for software reliability enhancements on CPUs and how these concepts have been adapted for GPU kernels and architectures. In the following section we will also look at an attempt to sacrifice the high coverage provided by software replication in exchange for performance improvements by using opportunistic and probabilistic protection.

4.1 Redundant Multithreading

Redundant multithreading (RMT) is a DMR concept that replicates computation at the granularity of a software context. Two thread contexts are launched and execute redundantly until a result must be committed outside of the SoR, for example, on a store to main memory, or an I/O event. When a result needs to be committed, the threads share this value, compare the two copies, and commit the result outside the SoR if they both agree on the value, called “output comparison.” Hardware extensions to support efficient RMT have been evaluated on CPUs for simultaneous multithreaded (SMT) cores [34, 35], identifying fairly reasonable average overheads of 32% [34] and 10–30% [35], with nearly 100% fault detection coverage. However, both techniques add special hardware structures to allow the architecture to efficiently coordinate redundant execution and output comparison.

Wang et al. [36] were the first to automate pure software redundant multithreading in a compiler and conduct experiments on real hardware. They show an impressive 19% overhead for redundant execution in simulation, but included the benefits of simulated low-latency intercore communication queues that do not exist in real hardware. When implemented on real SMT architectures, using software-only communication queues, their redundant multithreading technique saw a ∼500% increase in runtimes on average. Wang found that frequent communication of values between threads for output comparison was the main bottleneck. Even if threads were executing on the same SMT core, they had to use the cache hierarchy to communicate values, often passing values via the L4 cache. Using SMT threads as redundant thread pairs, values could be passed via the L1 cache, but performance of both threads would suffer because of contention for shared intracore resources such as execution units, L1 cache, and memory bandwidth. This is an interesting result and lesson: redundant threads that execute in separate cores are generally bottlenecked by communication latency, whereas redundant threads that execute in the same core are generally bottlenecked by computation resources designed for a single thread. Unlike CPUs, GPUs have a much different design and execution philosophy—and as we have seen in previous sections, their SIMD units are often underutilized—and thus may allow for much more efficient software RMT.

Dimitrov et al. [37] were the first to evaluate the performance impact of software redundancy on GPUs. Dimitrov investigated three ways of implementing software redundancy: (1) R-Naive, where a GPU kernel, including data copies to and from the host CPU is simply executed twice, (2) R-Scatter, where unused instruction-level parallelism in under-packed VLIW instructions is used for redundant instructions, and (3) R-Thread, where CUDA thread blocks are duplicated in a single kernel call. R-Naive works by first allocating twice as much memory in the GPU’s main memory and duplicating the memory copies from the host CPU’s main memory to the GPU. This means that the GPU’s main memory is included in the SoR at the cost of twice as much space. This is fine, as long as the GPU’s main memory is underutilized by half, but may be an untenable requirement for kernels that have high memory utilization. Dimitrov reported that R-Naive increased runtimes by an average (with low variance) of 99%, essentially the same cost as running the kernels in serial. This indicates that coarse-grained software replication may be unable to take advantage of fine-grained architectural underutilization, motivating finer-grained replication.

R-Scatter is more complex, and relies on underutilization in the VLIW-based architectures of older Array Technologies Inc. (ATI) or AMD graphics cards. Dimitrov et al. analyzed this technique on an R670, 5-way VLIW GPU. R-Scatter works by duplicating variable declarations, definitions, and all computations in the original code, essentially replicating every line of code in the original program with an identical, redundant version. In this way, there are always twice as many instructions without dependencies that can be scheduled into any VLIW instruction. Just like R-Naive, memory allocations and data transfers are duplicated (input replication), and redundant threads are guided to read to and write from these redundant structures. Dimitrov reported that R-Scatter increased runtimes by an average of 93%, indicating that there was perhaps not much underutilization in the VLIW schedule to begin with.

R-Thread works by launching a single GPU kernel call, but with twice as many blocks. Just as in R-Scatter and R-Naive, memory allocations and copies are duplicated, and the redundant threads in the redundant thread blocks are guided to do computation using these separate values. R-Thread was reported to increase overhead by a disappointing 97%, again indicating that thread-blocks were too coarse-grained to take advantage of unused thread-level parallelism, if it existed at all in the benchmarks on that architecture.

Based on this study alone, it seems that GPUs respond poorly to pure software redundancy, but there still may be room for replication at a finer granularity. R-Thread replicated computation at the thread-block level, and thus was unable to take advantage of SIMD-level underutilization. This study indicated that GPU SMs are usually highly utilized, and so thread-block-level redundancy may not be as cheap as researchers had hoped. However, the hardware reliability enhancements presented in the previous section indicate that SIMD underutilization is ripe for the taking, and thus GPU RMT might still have a place in practical reliability solutions if it is properly tuned.

Another issue with Dimitrov’s kernel-level replication is the duplication of memory allocations and copies. Modern day applications most likely will not be able to tolerate a 2× memory overhead (both 2× memory capacity and 2× memory bandwidth), especially over the PCIe bus. ECC is also already fairly standard on server class GPU DRAM and on-chip caches, and so increasing the strength of this ECC may be a better approach than duplicating allocations and data transfers. Further complicating matters is the danger in allowing any errors to exist in memory, even if they are backed up by redundant state. Extreme care must be taken to ensure that a fault cannot propagate to other accelerators, or storage via paging or DMAs, before being checked. In many cases, this is impossible to guarantee, motivating a protection domain, and output comparison that exists on-chip. Wang et al. attempted this method, requiring comparisons before commit for every shared memory operation, but found that intercore communication for output comparison was a large bottleneck.

Wadden et al. [38] reexamined GPU RMT on modern SIMD GPU architectures. Dimitrov attempted to exploit fine-grained parallelism in VLIW architectures; however, most modern GPU designs (at least AMD and NVIDIA architectures) have settled on a combined SIMD/vector model, instead of VLIW, thus Wadden focused exclusively on this type of execution model. Wadden took a similar approach to Wang et al. by implementing GPU RMT directly into AMD’s OpenCL compiler stack, thus maintaining transparency to the software developer, while also excluding the memory hierarchy from the SoR. Wadden explored two granularities of GPU thread replication: (1) intergroup RMT, similar to Dimitrov’s R-Thread, where kernels are launched with twice as many thread-blocks (work-groups), and (2) intragroup RMT, where each OpenCL work-group is inflated, and twice as many single instruction multiple threads (SIMT) execute per work-group. While these two techniques may seem very similar, they have extremely different overheads and different fault detection coverage. Both techniques and their differences in protection are described in the remainder of the section.

Intragroup RMT works by first modifying the host program to double the size of each work-group being launched in the kernel. Once this is accomplished, the kernel code intermediate representation (IR) is transformed during compilation to guide adjacent threads in adjacent SIMD lanes to do the same computation by forcing adjacent threads to query the same thread IDs. Because the only thing that distinguishes computation between threads is their thread IDs, two threads that read the same IDs will have identical behavior, albeit while using different registers! When a redundant thread pair encounters a store operation—and a single correct value must be committed outside the SoR to the memory hierarchy—one thread passes a value to the other thread via a communication buffer allocated in the GPU’s scratchpad memory. The receiving thread then compares its private value to the passed value and then executes the store instruction if the values agree. In this way, all replicated structures that are used to execute these two threads are within the SoR; however, structures that are shared between threads in an SIMD instruction, such as the L1 cache, instruction fetch and decode logic, and scalar instruction unit, are not protected at all by this technique. This is an interesting result.

(1) Intragroup RMT does not protect against faults in structures shared by threads in one SIMD instruction. This is an example of the hidden error problem. Hidden errors are usually defined as faults encountered by both redundant CPU threads, such as redundant threads using the same faulty ALU. In this context, hidden errors arise because of the SIMD programming model. A single fault in an SIMD instruction in an SM’s instruction queue will cause all threads in that instruction to misbehave in the same way! Therefore the adjacent redundant threads in one SIMD instruction will not be able to detect the fault. As previously mentioned, instruction and data caches, the GPU’s scalar support unit, branch unit, instruction fetch, decode, and scheduling logic are all left unprotected. However, the large register files and SIMD ALUs are protected.
Because modern GPU SMs include managed scratchpad memories that are not included in the global memory hierarchy, this structure can be either included (covered) or excluded (uncovered) from the SoR. If the scratchpad memory is included, its allocations and memory traffic must be duplicated. If the application cannot tolerate an increase in the scratchpad size, the scratchpad memory can be excluded from the SoR. In this case, the scratchpad must be treated just like main memory; every time a redundant thread pair wants to store a value to the scratchpad, their values must be shared and compared via an interthread communication buffer in the scratchpad first, resulting in an increase in scratchpad traffic. Results show a few interesting conclusions.

(2) Overheads of DMR are either high (∼100%) or low (0–10%) depending on how memory-bound the application is. Because memory instructions are not replicated, redundant computation and communication of redundant thread pairs can be hidden behind these long latency instructions, especially if the kernel is bottlenecked by memory operations. This is a very interesting result, as anecdotally, many real-world HPC benchmarks are memory-bound, and so this type of fine-grained RMT may be a viable solution for fault coverage in real systems. Applications that were already compute-bound had predictable ∼2× increases in runtime when run with twice as many threads.

(3) Interthread communication via the scratchpad is very expensive. While the latency of communication traffic via the scratchpad can be hidden behind the latency of long latency main memory stores, that same latency cannot be hidden behind stores to the scratchpad with identical latency. Thus excluding the scratchpad from the SoR and forcing communication via a communication buffer in the same structure has a large impact on performance if scratchpad traffic was high in the original kernel.

(4) Doubling the size of the work-groups is expensive when capacity (registers, scratchpad) is a scarce resource. This is a specific example of a broader lesson: if you have to double usage of a resource that bottlenecks your application, expect a near ∼100% (or greater because of overheads!) increase in runtime.

Because intragroup RMT guarantees that redundant thread pairs execute in lockstep in adjacent SIMD lanes, Wadden et al. also experimented with using register swizzle operations to accomplish interthread communication without the need for explicit scratchpad buffers. This optimization had a large impact on performance when intrathread communication was expensive, decreasing communication overhead to nearly zero and decreasing kernel runtime overheads by as much as 90%.

Intergroup RMT works similarly to Dimitrov’s R-Thread thread-block-level replication of computation. Instead of increasing the size of a kernel’s work-groups, intergroup RMT increases the number of work-groups by two. Therefore redundant computation is accomplished by a redundant work-group pair, and redundant thread pairs are in separate work-groups, possibly executing on entirely separate SMs. Unlike intragroup RMT threads, which were forced to execute in lockstep in adjacent lanes of an SIMD instruction, intergroup threads execute in different SIMD instructions, and therefore require explicit synchronization. Furthermore, because threads in different work-groups cannot be guaranteed to execute on the same SM, they cannot communicate through an SM’s scratchpad memory, and must communicate via a buffer in main memory. Intergroup RMT has higher coverage than intragroup RMT because threads executing in different work-groups are using completely redundant instruction streams. Therefore the instruction fetch, decode, scheduling logic, as well as the branch and scalar units of the AMD GPUs are protected with this technique. Unsurprisingly, intergroup RMT greatly increases kernel runtimes. The chapter presented these interesting conclusions.

1 Memory-bound applications are bottlenecked by the extra memory traffic needed to synchronize and communicate redundant values. Some applications experience devastating slowdowns of 3.4× to 9.5×, similar in scale to those experienced by Wang et al. for CPU threads. As concluded by Wang et al., this motivates high-speed communication buffers between SMs, but perhaps indicates finer-grained RMT is a better target for practical software-based RMT on GPUs.

2 SM underutilization can allow for low-cost execution of redundant work-groups, but underutilization at this granularity is most likely not a realistic scenario. GPUs gain advantage on extremely large, embarrassingly parallel problems, thus it is usually not the case that a GPU is ever starved for work-groups waiting to execute. Some applications with dynamic parallelism, like a reduction, may experience SM underutilization in later stages of computation, but this is not the dominant use of GPUs.

3 Compute-bound applications have predicted ∼2× slowdowns. If the memory system is underutilized, it can handle extra traffic introduced by synchronization and communication code, and kernels run with twice as many work-groups and experience expected ∼2× slowdowns.

4.2 Symptom-Based Fault Detection and Dynamic-Reliability Management

As we saw in the previous sections, hardware protection provides high error detection coverage, but requires (most likely) economically infeasible additions to the hardware. Software-redundant multithreading can be accomplished cheaply, but can be accompanied by extremely high performance overhead, depending on the level of protection required and the dynamic behavior of the application. To address these issues, researchers have looked toward probabilistic protection, sacrificing identification of some percentage of SDCs within a protection domain for a decrease in overhead. In this section we look at a few techniques that attempt to implement this trade-off.

Yim et al. propose Hauberk [21], an error detection framework that selectively protects sections of code with software checkers. Instead of using DMR and duplicating every instruction within a vulnerable region, Hauberk selectively places different types of software checkers that look for symptoms that a transient fault has affected program output. This is a subtle but important difference in reliability philosophy and DMR. Redundancy checks for faults via direct comparison of values within an SoR while symptom-based techniques such as Hauberk check for symptoms of faults in a possibly nonredundant domain using checkers. Full redundancy might have higher reliability guarantees, as it can detect 100% of faults within the SoR, but unfortunately it raises alarms for all false-positive faults that might never manifest as SDCs or DUEs in program output. Symptom-based detection ignores most false positives by looking for evidence of errors, but may allow for some SDCs that would have been caught by full replication. Hauberk attempts to optimize this trade-off by using relatively expensive checksum-based live range value checkers to protect important code that is executed infrequently, and low-overhead, accumulator-based software checkers to selectively protect values in frequently executed loop bodies in GPU kernel code.

Yim et al. first do an in-depth analysis of failures due to faults in GPU HPC codes. Using the SWIFI fault-injection tool discussed previously, Yim et al. describe two important observations that affect the design of their reliability scheme:

1 Because integer and pointer data are more likely to be used for important data values and program control, they are also more likely to cause user-visible errors or program crashes. However, floating point data are just as likely to cause SDCs (as opposed to crashes) as integer data. Therefore control data in loop bodies is most likely less important to protect because it is more likely to cause a user-visible crash rather than the more insidious SDC.

2 87% of kernel execution time was spent in loop code (five out of seven applications spent > 98% of their time in looping code!). Therefore a small added overhead to a loop may have a large impact on overall performance, and software checkers in loops should be as lightweight as possible.

Using these observations, Hauberk first profiles a GPU kernel to determine which lines of code have the highest impact on performance, placing strong error detection (dynamic reliability management, DRM) around nonlooping code, and lightweight checking code inside loop bodies. Strong detection for nonlooping code involves duplicating the definitions (assignments) to all virtual variables. Once this computation is duplicated (and checked to make sure that the duplicated variables match in value), the value is XORd with a checksum variable shared among all duplicated variables. After the last use of this variable (at the end of the variable’s live range), the variable is again XORd with the checksum. Therefore, if all defined values XOR with the checksum twice, at the end of the kernel, the checksum must be zero. If the checksum is not zero, then the kernel has seen a fault, and an error can be reported. For lightweight loop-based detectors, Hauberk uses accumulators inserted into loop bodies that accumulate the value of a variable. After the execution of the loop, Hauberk compares the accumulator values against values gathered from profiling runs. Dubbed “value-accumulation-based range checking,” Hauberk simply checks whether an accumulator value is within a reasonable range of the stored profile value based on how many iterations were executed. If the value is suspiciously far away from the profiled results, it is flagged as a potential error.

Potential errors are then distinguished from false positives using a diagnosis and tolerance algorithm. Experiments conducted on real hardware revealed that Hauberk was able to capture 86.8% of injected faults on average, leaving 13.2% of injected faults to lead to SDCs. It should be noted that this type of coverage metric does not provide any type of reliability guarantee on any part of the system, and so should again be distinguished from DMR. Probabilistic protection provides an average probability of fault-detection over the entire architecture, where DMR provides 100% fault-detection within an SoR that may only cover portions of the architecture. Performance overheads were 15% on average, ranging from very low (1.9%) to very high (∼52%) depending on the time spent by a GPU kernel in loop versus nonloop code.

Li et al. also examined overheads of probabilistic symptom-based protection on real GPU hardware [39]. Li used the Lynx dynamic instrumentation tool for NVIDIA GPU kernels [40] to instrument PTX IR of real GPU kernels with error checking code. Li implements three different software checkers: (1) alignment checkers for errors in memory address operands, (2) array bounds checkers for detecting errors in array addresses, and (3) control flow checkers for errors in branch addresses. Alignment checkers look at whether the target address of each memory instruction has the same multiple of bytes as the target data size based on the base address. If not, error-logging code is inserted after the check to write error data out to a memory log such that the CPU can take appropriate action (e.g., rerun the GPU kernel). Array bounds checkers look to see if address operands of memory instructions are within known global allocations. A table of these global allocations is computed at compile time and queried by the checker on every single memory access. Li’s control flow checkers use the same control flow checking technique implemented for CPUs by Oh et al. [41] and are inserted into each GPU basic block. Control flow checkers place a unique signature at the beginning of each basic block corresponding to any legal basic block predecessors. If the calculated signature from the entry block does not match the unique signature determined by legal predecessors, the entry into the block was not a legal one, and an error is presumed. Probabilistic error coverage was not calculated for these techniques via ACE analysis or fault-injection. Total overheads for all evaluated benchmarks ranged from 1% to 737%, with the most expensive overheads contributed by the alignment and array bounds checkers.

In separate but related research with AMD Research, Li et al. proposed combining parts of Hauberk and the software checkers from their previous work into a single introspective reliability runtime framework [42]. Dubbed dynamic reliability management or DRM, the framework monitors a GPU kernel at runtime and selectively applies different software reliability enhancements (SRE) via just-in-time compilation (JIT) depending on the vulnerability of the running kernel. To characterize the vulnerability of the kernel, Li calculates the kernel’s PVF previously discussed. PVF can be calculated by doing a static data-flow analysis, and thus can be done offline on a kernel’s IR, and does not affect runtime performance. Once PVF is calculated, the overhead of each SRE is calculated using an SRE-specific performance model. Once the PVF and overhead are calculated, the model within DRM decides whether or not to apply the SRE. Once executed, profile information about the kernel can be gathered and used to update the SRE-specific performance model. The different SREs that can be applied to the kernel are (1) Hauberk’s checksum-based, value live-range checks, (2) instruction duplication, and (3) control flow checkers as previously described. Li et al. conducted an experiment to measure the overhead of individual SREs applied to matrix multiplication, and showed that runtimes for instruction duplication ranged from 1.10×, 1.34×, and 1.37× depending on whether the computation was integer, floating point, or a memory load, respectively. Control flow checking only added 1% to the runtime of the matrix multiplication kernel. Register live-range, however, increased the kernel runtime by 2.51×.

5 Summary

This chapter surveyed the current state-of-the-art research into GPU reliability. We first described some mechanisms for failures in GPUs, the most insidious of which is the radiation-induced transient fault. Transient or “soft” faults are caused by cosmic radiation that can inject charge into silicon, flipping a bit to an incorrect value. These types of faults are more difficult to detect than permanent hardware failures because they are rare and nondeterministic. If a fault silently propagates to program output, we call this a silent data corruption or SDC. SDCs are the worst possible outcome of a fault because at least when an unrecoverable fault is detected, the known error can motivate some sort of recovery mechanism.

We then framed GPU reliability as an important problem. Many high-node-count data centers and supercomputers experience transient faults much more often than single GPUs, and field data gathered from real supercomputing systems supports this claim. Therefore, although GPUs are gaming products, they must have similar reliability guarantees as server class CPUs. This problem is made more difficult by the fact that the economics driving changes to GPU architecture and design are still dominated by the gaming and graphics markets, rather than scientific or HPC use cases.

We then looked at the state-of-the-art research in hardware reliability enhancements for GPUs, showcasing various hardware structures that were much less expensive than full hardware redundancy, while greatly increasing protection against SDCs. One technique looked at using underutilized GPU scratchpad memory to implement ECC, without the large overhead in area and power associated with ECC. While reducing the cost of ECC is a net benefit, ECC and SRAM protection techniques do not protect pipeline logic. To address this shortcoming, one technique used special redundant ALUs and buffers to address this reliability hole. Other techniques addressed this shortcoming by attempting to take advantage of underutilization in SIMD hardware to execute extra redundant instructions at a low cost.

Finally, we surveyed current attempts to provide reliability for GPU hardware with purely software reliability enhancements. Hardware reliability enhancements generally protect against a large percentage of possible SDCs, but cannot be applied to legacy hardware and are extremely expensive (in dollars, engineering hours, GPU performance, power, etc.) to include in future hardware. Therefore software reliability enhancements are proposed as a cheap, practical alternative as long as added reliability and additional performance and power overheads are acceptable. An application can simply be run twice, gaining ∼100% protection against SDCs, but this generally incurs a ∼100% performance overhead. More fine-grained replication can be accomplished, taking advantage of the same SIMD underutilization previously mentioned, but this often comes at the cost of reduced coverage: redundant threads in the same SIMD instruction do not protect against faults in shared structures such as a GPU’s SIMD instruction buffer. While redundant threads protect a subset of a GPU from ∼100% of SDCs, probabilistic software reliability techniques provide an average probability of catching an SDC over the runtime of the program. This compromise generally leads to lower overhead, but is harder to consider as there are no reliability guarantees provided anywhere by the hardware or software. Probabilistic coverage is implemented as special checker instructions, selective reexecution, and duplication of instructions and data. Such selective techniques can be employed in an introspective manner so that profiling or runtime information can guide the system to employ the best reliability scheme with the lowest performance penalty.

Whereas CPU reliability research spans decades, GPU reliability research is a relatively young field. And while GPUs face similar challenges as CPUs, because of a GPU’s unique architecture, potential solutions to these problems may carry very different protection, and performance and power overheads. Thus practical solutions to GPU reliability may look different than those used to help protect server class CPUs, and may also be much more costly. One possible path forward is to codify unreliable behavior into an architecture’s ISA. This line of thinking asserts that if we cannot uphold the hardware/software contract provided by the ISA, then we should notify programmers and force them to account for the consequences. Though it is unclear whether speedy “approximate computing” will ever be accepted by programmers, future research should consider these techniques, and practical implementations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 22: Advances in GPU reliability research

Create new playlist

Sign In

Sign Up