

Accelerated alpha particle tests, 62–63
Accelerated neutron tests, 63–66
monoenergetic neutron beam, 64
proton beam, 65
white neutron beam, 63
Active load address buffer (ALAB), 234–235
Alpha ISA, 152
Alpha particle, 20
accelerated tests, 62–63
architectural fault models for, 30–31
contamination, 2
impact on circuit elements, 45
interaction with silicon crystals, 26
soft errors due to, 63
Alpha radiation, 21
AMD’s Opteron processor, 133, 187
AN codes, 182–183, 315–316
Application-level recovery, 315
Architectural ACE bits, 90
Architectural ACE versus un-ACE paths, 91
Architectural derating factor, 80
Architecturally correct execution:
instruction per cycle (IPC) of, 101
principles, 90
types of, 90
Architecturally correct execution analysis:
and fault injection, comparison of, 147–149
using point-of-strike fault model, 106
using propagated fault model, 114–117
Architectural un-ACE bits:
dynamically dead instructions, 95
logical masking, 96
NOP instructions, 94
performance-enhancing operations, 94
predicated false instructions, 95
Architectural vulnerability factor:
algorithm, data structures for, 106
basics, 80
of bit, 81
of branch commit table, 100
of CAM arrays, 135, 143–144
DUE and SDC, 86
of hardware structure, 96–97
of Itanium®2 execution unit, 113–114
of Itanium®2 instruction queue, 109–113
of latches, 148
of RAM arrays, 123, 141–142
from SoftArch’s evaluation, 114–117
Architectural vulnerability factor
using ACE analysis, 104–105
using Little’s law, 98–101
using performance model, 101–105
using SFI, 146
Arithmetic codes. See also AN codes; Residue codes
AR-SMT, 239
Assertion checkers, 299


Backward error recovery, 256
checkpoint-based schemes, 256–257, 319
with fault detection after I/O commit, 292
with fault detection before I/O commit, 283
with fault detection before memory commit, 277
with fault detection before register commit, 263, 270
granularity of fault detection in, 257–258
incremental and periodic checkpointing, 278
log-based schemes, 317
output and input commit problems, 256, 264
using global checkpoints. See ReVive
using local checkpoints. See SafetyNet
BCH codes, 177
Binary translation, 306–307
Black’s law, 15
Blech effect, 16
Bohr model of atom, 21
Boron-10 isotopes, 3
Boro-phospho-silicate glass (BPSG), 3
Bragg peak, 27
Branch outcome queue, 237, 240
Branch predictors, faults in, 88, 214–215
Buffer control element (BCE), 220
Burn-in, 8
Burst errors, 178
Burst generation rate (BGR) method, 47


Cache-coherent shared-memory multiprocessors, 288
Cached load data, 231, 233
input replication of, 234
C-Element, 74
Checking store buffer (CSB), 310
Checkpoint-based backward error recovery:
compile- and run-time methods in, 320
for shared-memory programs, 319
Checkpoints, 256–257, 286–287
Chip-external fault detection, 281
Chip-level redundantly threaded
processor with recovery (CRTR), 269–270
Chip-level redundant threading (CRT) processors, 240–241
Chip multiprocessor (CMP). See Multicore processor
Circuit-level SERs, modeling of, 44
Clock circuits, vulnerability of, 59–60
Clock jitter, 59–60
Code bits, 163, 166–168
Code words, 163
Hamming distance of, 164–165
Combinatorial logic gates: masking effects in, 52
SER of, 56
Compiler-assisted fault tolerance (CRAFT), 310
evaluation of, 311
versus SWIFT, 310
Complementary metal oxide semiconductor transistors:
field funneling effect in, 49
permanent faults in, 14
radiation-induced transient
faults in, 20
structure of, 17
switching speed of, 17, 68
Configurable transient fault detection, 306
Content-addressable memory arrays:
AVF of, 135
best-estimate SDC AVFs of, 145–146
bit flip in, 122
of data translation buffer, 143
DUE AVF of, 146
false-negative matches in, 137
false-positive matches in, 135–137
hamming-distance-one match in, 137
lifetime analysis of, 134
mechanics of, 122
of store buffer, 143
of write-through and write-back cache, 144
Cosmic radiation, 3
2048-CPU server system, 3
Critical charge (Qcrit), 3, 29
computation of, 46
to FIT, semiempirical mapping of, 46
Cycle-by-cycle lockstepping, 212. See also Lockstepping
Cyclic redundancy check codes, 178–181, 300
encoding and decoding process, 179–180
generator polynomials, 181
principle of, 179


Database logs, 317
log anchor, 318
log files, 318
log manager, 319
structure of, 318
Database systems, 317
Datapath latches, 114
Data translation buffer, 128, 139
CAM array, 143
RAM array, 142
Deadlocks, for synchronization primitives, 321
Dead man timer, 294
Decoding process. See Encoding and decoding process
Delay buffer, 239
Dependability models, 11–14
availability, 13
maintainability, 13
performability, 14
reliability, 12
safety, 14
Dependence-based checking elision (DBCE), 245
Detected unrecoverable error, 3
AVF of bit, 86
budgets, 34
definitions, 32–34
false events, 82, 86, 132–133, 195–197, 214
FIT of bit, 83
FIT of chip, 84–85
process-kill versus system-kill events, 89
tolerance in application servers, 35
true events, 33, 82
Distributed parity, 289
Double bit errors:
detection of, 174
kinds of, 189
Double-bit faults, 176
Double-error correct triple-error detect code, 176–178
Hamming distance of, 177
parity check matrix for, 177
syndrome, 178
Dual-in-line packages, 63
Dual-interlocked cell (DICE), 71–72
Dual-interlocked memory module (DIMM), 37
Dual modular redundancy (DMR) system, 208, 259–260
Dynamically dead instructions, 95, 196, 246
Dynamically scheduled superscalar pipeline, 152–153
masking effects of injected faults in, 152
transient faults in, 154–157
Dynamic implementation verification architecture (DIVA), 241–242
CHKcomm pipeline, 243
CHKcomp pipeline, 242
trade-offs in, 243
Dynamic logic gate:
evaluating NAND function, 57–58
evaluating NOR function, 59
masking effects in, 57–59
Dynamic random access memory:
FIT/bit of, 62
scaling trends, 37–38


Edge effects, 138
Edge-triggered flip-flop, 50. See also Flip-flop
Edge-triggered latch, 55
Electrical masking, 37, 53
modeling of, 55
Electromigration (EM), 15–16
Electron–hole pairs, 18, 27–29
Electrons, 21
Emitter-coupled logic (ECL), 220
Encoding and decoding process, 162–163, 179–180
Endurance 4000, 223–224
Error, 7–9
isolation of, 203
recording information about, 203
Error codes, 161
Error coding:
area overhead of, 189–190
basics of, 162
Error correction codes, 2, 5
for state bits, 162
overheads of, 187–190
Error detection:
for execution units, 181
overheads of, 187–190
using parity codes, 168–169
Error detection by duplicated instructions (EDDI), 303
evaluation of, 304
transformation, 303–304
Error information, propagation of, 197
Error recovery mechanism, 254
EverRun servers, 223, 297, 315
Exponential failure law, 12
External interrupts, 233
Extrinsic faults, 14


Fail-over systems, 258–259
Failure in time, 9–10
of bit-level DUE, 83
of bit-level SDC, 83
of chip-level DUE, 84–85
of chip-level SDC, 84–85
mapping of Qcrit to, 46
Failure in time/bit:
of DRAM, 62
of SRAM cell, 61
Failures, 8
False errors, 201
on conditional branches, 196
detection of, 194
on dynamically dead instructions, 196, 199
in narrow values, 196, 200
on neutral instruction types, 198
and true errors, difference between, 197–198
on uncommitted instructions, 198
Fault detection, 5
after I/O commit, 292
C-element for, 74
granularity of, 257–258
before I/O commit, 283
before memory commit, 277
before register commit, 263
in SRT-Memory sphere, 286
using binary translation, 306
using cycle-by-cycle lockstepping, 212
using redundant execution, 208
using RMT, 222
Fault free checkpoint, 278, 281
Fault isolation, 313
Fault propagation, 116
Faults, 6–7
in branch predictors, 88, 214–215
in logic gates, 53
in silicon chips, 6
Fault screeners:
natural versus induced perturbations, 274–276
versus parity code, 273–274
research in, 276–277
Fault screening, with pipeline squash and re-execution, 173
Fault secureness, 182
Fault-tolerant computer system, 212, 216, 218, 259
Faulty bits:
in microprocessor, 81
outcomes of, 32–33
Fetch throttling, 271
Field data collection, 62
Field funneling, 49
Field-replaceable units (FRUs), 203
Fingerprinting, 278, 280
chip-external fault detection using, 281
First-level dynamically dead (FDD) instructions, 95, 107, 196
Fixed-interval scrubbing, 193–194
Flip-chip packages, 63
timing diagram of, 50–51
TVF of, 50
Forward error recovery, 255
DMR systems, 259
fail-over systems, 258–259
pair-and-spare systems, 262
triple modular redundancy system, 260–262
using triplication and arithmetic codes, 315
Fujitsu SPARC64 V processor:
error checkers in, 265
parity with retry, 264–265
Full adder, logic diagram of, 54
Full-state comparison bandwidths, 281–282


Galactic particles, 22
Gate oxide failure modes, 17
Gate oxide insulation, 17
Gate oxide wearout, 18
Geomagnetic rigidity (GR), 25
Global checkpoints, 288, 290, 321
Global recovery point, 291


Hamming code, 172
Hamming distance:
of code word, 164–165
of DECTED code, 177
of parity code, 168
of SEC codes, 173
of SECDED codes, 174
Hamming-distance-one analysis, 122, 135, 137
Hard errors, 8
Hardware assertions, 200–202
Hardware error recovery schemes, 254
Hazucha and Svensson model, 46
Hewlett-Packard NonStop Himalaya architecture, lockstepping in, 218–219
Hewlett-Packard NSAA. See NonStop® Advanced Architecture
High-k materials, 17
High-performance microprocessor, 70, 102
History buffer:
adding entries to, 279
freeing up entries in, 279
recovery using, 279
structure of, 279
Hot carrier injection (HCI), 18
Hybrid RMT implementation, 310
“Hydrogen-release” model, 19
Hypervisors, 313


IA64, 95, 107, 109
IBM G5’s Lockstepped processor architecture, 220–222
IBM Z-series processors:
lockstepping in, 220
lockstepping with retry, 265
ICount policy, 228, 237
Incremental checkpoint, using history buffer. See History buffer
Inelastic collisions, 26
In-line error detection, 187
Instruction fetch buffer, 312
Instruction queue, 98, 101, 112, 197, 270.
pipeline squash for, benefits of, 272–273
Instruction reuse buffer, 246
Integer register file, 312
Interleaving, 168–169, 190
Intermittent errors, 8
Intermittent faults, 6
Intrinsic faults, 14
Itanium® 2 execution unit, 108
AVF analysis for, 113–114
Itanium® 2 instruction queue, 108–109
ACE and un-ACE breakdown of, 109–110
AVF analysis for, 109–113
Itanium® architecture, 195
Itanium® processor, 1, 66, 159
Itanium® 2 performance model:
evaluation methodology, 107
program-level decomposition, 108


Joint electron device engineering council (JEDEC) standard, 23, 63


Latches, 30–31
addition of capacitors to, 70
AVF of, 148
fault injection in, 154–157
in performance simulator, 148
scaling trends, 36
SERs of, 37
vulnerability of, 155
Latch-window masking, 54–56
Lifetime analysis:
of ACE and un-ACE components, 124
of CAM arrays, 134
cooldown in, effect of, 138–140
of RAM arrays, 123
Linear particle accelerators, 76
Little’s law, 181
AVF breakdown for instruction queue with, 112
for AVF computation, 98–101
Load/store queue (LSQ), 228
Load value queue, 235–236, 240, 268, 283, 311
logging loads using, 287
in SRT processor, 236, 284
Lockstep failure, 214
Lockstepped checkers, 87–89
Lockstepping, 87, 211
advantages of, 213
disadvantages of, 213–216, 225
in HP NonStop Himalaya architecture, 218–219
in IBM Z-series processors, 220, 265
in software, 301
in Stratus ftServer, 216–218
Lockstep processors, 214–215
Log-based error recovery, 283
in database systems, 317
in piecewise deterministic system, 283
Logical masking, 53, 96
logic-level simulation for, 57
modeling of, 54
Logical synchronization unit (LSU), 226
Logic derating factor, 80, 118
Logic gates:
faults in, 53
SER of, 52
technology scaling on, 57
Log sequence number (LSNs), 318
Loose lockstepping, 212, 225. See also Lockstepping
Los Alamos Neutron Science Center (LANSCE), 47


Machine check architecture, 202–203
Marathon InterConnect (MIC) card, 223
Mean instructions to failure (MITF), 11, 271–272
Mean time between failures (MTBF), 10
Mean time to failure (MTTF), 5, 9, 103, 271
computation of, 114–116
of microprocessors, 5, 66
of temporal double-bit error, 191
Mean time to repair (MTTR), 10
Mean work to failure (MWTF), 11, 312–313
Median time to failure (MeTTF), 9
Memory cells, 31, 44, 179
Metal failure modes:
electromigration, 15–16
metal stress voiding, 16
Metal lines, voids in, 15
Metal stress voiding (MSV), 16
Metrics, 9–11
Microarchitectural ACE bits, 90
Microarchitectural un-ACE bits:
ex-ACE state, 93
idle or invalid state, 93
misspeculated state, 93
predictor structures, 93
Microprocessor, 4, 30, 53, 102
false DUE events in, 195–197
faulty bit in, 81
instruction queue in, 98
MTTF of, 5, 66
predictor structures of, 93
SER of, 43
validation of, 214
Mitigation techniques:
circuit enhancements, 68–74
device enhancements, 67–68
Monoenergetic neutron beam, 64
Multibit errors, 31–32
Multibit faults, 31
Multicore architecture, RMT in, 240
Multicore processor, 240, 269


Negative bias temperature instability (NBTI), 19
Neutron, 21, 23
accelerated tests, 63–66
impact on circuit elements, 45
interaction with silicon crystals, 26
Neutron beam, 65–66
Neutron cross-section (NSC) method, 48
Neutron flux, 23–25
Neutron-induced SER, 62–63
Neutron strike:
architectural fault models for, 30–31
on storage device, 31
nMOS transistors, 17
Nonrecovery mode, handling faults in, 286
NonStop® Advanced Architecture, 211, 225–227
reintegration in, 261–262
NonStop kernel, 218, 262
NonStop servers, 225
NOP instructions, 94


Odd-weight column SECDED code, 175, 187. See also Single-error correct double-error detect code
Online transaction processing (OLTP)
workload, 281
OpenMP library, 320
OS-level recovery, 299, 322
Out-of-band error decoding and correction, 189


Pair-and-spare systems, 262
Paravirtualization, 313
Parity bits, 170, 289
Parity check matrix:
of DECTED code, 177
properties of, 176
of SEC code, 170–172
of SECDED code, 174–176
Parity codes, 168–169
Parity prediction circuits:
for addition operation, 185
for multipliers, 186
Partial RMT techniques, 245–246
π bit, 197
on caches and memory, 200
for every register, 199
Perceptual vulnerability factor, 81
Periodic checkpoint, 278
with fingerprinting, 280
Permanent errors, 8
Permanent faults, 6
in CMOS transistors, 14
Pin dynamic instrumentation framework, 306
Pions and muons, 23, 29
pMOS latch, 71
pMOS transistors, 17
Point-of-strike fault model, 106
versus propagated fault model, 91–92
Polynomial division, 179
potentialCheckpoint() call, 319
Predicated false instructions, 95
Predicate register file, 312
Primary cosmic rays, 22
Process-kill DUE events, 89
Process pair, 262
Product codes, 170
Program’s execution, fault-free and faulty flow of, 105
Propagated fault model, 114–117
Propagation delay, 51–53
Proton beam, 65
Protons, 21
Pseudo-device driver (PDD) software layer, 323


Radiation exposure reduction:
with pipeline squash, 270
triggers and actions, 271
Radiation-hardened cells:
DICE latch, 72
DICE memory cell, 72
pMOS latch, 71
Radiation-hardening, 70
Radiation-induced transient faults, 2
in CMOS transistors, 20
Radioactive contamination, 3
Radioactive isotopes, 62
Random access memory (RAM) arrays:
AVF of, 123
best estimate SDC AVFs of, 142–145
of data translation buffer, 142
DUE AVF of, 131–134, 146
fault injection in, 154–157
of store buffer, 142
of write-through and write-back cache, 141
Random access memory arrays, lifetime analysis of:
basics, 123
of bit, 124
effect of cooldown in, 125
granularity of, 130
of one-bit cache, 126
structural differences in, 125
working set size for, 129
Reboot, 255
Recovery mode, handling faults during, 287
Redundant execution schemes, 207
Redundantly multithreaded (RMT), 219, 222
enhancements in, 244
in Hewlett-Packard NSAA, 225–227
implementation in software. See Software RMT implementation
in Marathon Endurance server, 223–225
in multicore architecture, 240
performance degradation reduction, 244
relaxed input replication, 244
relaxed output comparison, 245
in single-processor core, 227
using specialized checker processor, 241
Redundant virtual machine (RVM), 299, 313–315
Register check buffer, 231–232
Register name authentication (RNA), 201
Register transfer language (RTL), 102, 148
Register update unit (RUU), 228
Register value queue (RVQ), 268
Reliability and Security Engine (RSE), 201
Rendezvous point, 226
Residue codes, 183–185
for addition, 183
for integer operations, 183
for multiplication, 183
ReVive, 284, 288
distributed parity, 289
global checkpoint creation, 290
logging writes, 289
“R Unit, ” , 248–249


SafetyNet, 284
checkpoint coordination in, 291
global recovery point, 291–292
local checkpoint creation, 290
Scrubbing, 134, 176, 190–194
Secondary cosmic rays, 23
ServerNet, 219
Shared-memory parallel program:
deadlock scenarios for barrier and locks, 321–322
with potentialCheckpoint() call, 319
saving state, 321
Signature checkers, 299–300
Signatured instruction streams (SIS), 299–300
Silent data corruption, 3. See also Detected unrecoverable error
AVF of bit, 86
budgets, 34
definitions, 32–34
FIT of bit, 83
FIT of chip, 84–85
tolerance in application servers, 35
Silicon chips, 4
faults in, 6
lifetime of, 19
Silicon-on-insulator (SOI) technology, 67–68
Simultaneous and redundantly threaded processor with recovery (SRTR) processor, 266–268
active list and shadow active list, 268
commit vectors, 269
load value queue, 268
prediction queue (predQ), 268
register value queue, 268
Simultaneous and redundantly threaded (SRT)-memory:
fault detection in, 286
input replication in, 232
output comparison in, 230–231
Simultaneous and redundantly threaded (SRT) processor:
asynchronous interrupts in, 288
checkpointing in, 286
input replication in, 232
instruction replication in, 232
load value queue (LVQ)-based recovery in, 236, 284
logging in, 286
output comparison in, 230
performance evaluation of, 236, 238
redundant threads in, 229
sphere of replication in, 229–230
Simultaneous and redundantly threaded (SRT)-register:
input replication in, 233
output comparison in, 231–232
Simultaneous multithreaded (SMT)
processor, 228–229
Single-bit error, 32, 165, 167, 170, 174
Single-bit faults, 32, 135, 137, 162–163
Single-error correct double-error detect code, 132, 165, 174–176
Hamming distance of, 174
parity check matrix of, 174–176
syndrome, 174
Single-error correction, 170–173
encoder and decoder, 187–188
Hamming distance of, 173
overhead of, 166, 168
parity check matrix for, 170–172
syndrome, 172
Slack fetch mechanism, 237
SlicK, 246
SoftArch, 114–117
Soft error rates, 5, 11, 30
of CMOS chips, 24
of combinatorial logic gates, 56
of latches, 37
of logic gates, 52
measurements of, 60
of SRAM cells, 36
Soft errors:
accelerated measurements of, 62–63
cost-effective solutions to, 4–6
due to alpha particles, 63
evidence of, 2–3
field data on, 62
protection schemes, 5
scaling trends, 36–38
sensitivity, 80
types of, 3–4
Software assertions, 299
Software bugs, 4, 259
Software checkers, 299–300
Software error recovery, 299, 315
Software fault detection, 299
limitations of, 309
using hybrid RMT, 309
using RVMs, 313
using signatured instruction streams, 299–300
using software RMT, 301
Software fault-tolerance, 297
implementation options for, 298
Software-implemented fault tolerance (SWIFT), 305–306
Software RMT implementation, 298, 303
fault detection using, 301
sphere of replication of, 302
using binary translation, 306
Solar cycle, 22
Solar particles, 22
Spallation reaction, 64
SPEC CPU 2000 benchmarks, 95, 281
SPEC CPU 2000 floating-point (SPEC CFP), 282
SPEC CPU 2000 integer (SPEC CINT), 282
SPECWeb workload, 281
Sphere of replication, 208, 223
components of, 208–209
in Endurance machine, 223
in G5 microprocessor, 220
inputs to, 232
in NSAA, 226
output comparison and input replication, 211
size of, 209–211
in SRT processor, 229
Spot, 306
evaluation of, 307
performance-reliability trade off, 308–309
S390 Servers, 248
Static random access memory, 3
addition of capacitance to, 69–70
alpha particle impact on, 45
FIT/bit of, 61
scaling trends, 36
Statistical fault injection (SFI), 102, 148
architectural and microarchitectural state comparison in, 151
AVF computation using, 146
in latches and RAM cells, 156–157
random sampling in, 149–150
into RTL model, 148, 151
Statistical fault injection (SFI) study, at Illinois:
logic blocks in, 156–157
methodology, 152–154
processor model in, 152
Stopping power, 26–29
Store buffer, 128, 132
CAM array, 143
RAM array, 142
Store value prediction, 246
Stratus ftServer, 259, 261
DMR configuration, 216
fault detection and isolation, 216–217
lockstepping in, 216–218
TMR configuration, 217
SWIFT-R triplication and validation, 316
Symmetric multiprocessors (SMP), 216
Symptomatic fault detection, 273
Syndrome, 172, 174, 178
System-kill DUE events, 89
System-wide checkpoints, 283


Temporal double-bit error:
DUE FIT of, 191–194
with fixed-interval scrubbing, 193–194
MTTF of, 191–193
without scrubbing, 191–192
Terrestrial cosmic rays, 24
Terrestrial differential neutron flux, 25
Thorium-232, 62
Timestamp-based assertion checking (TAC), 201
Time to failure (TTF), 9, 115
Timing vulnerability factor (TVF), 50–52
Transient faults, 2, 6, 154, 156, 182.
Transistors per chip, 1
Transitive dynamically dead (TDD) instructions, 95, 107, 196
Translation lookaside buffer (TLB).
Transmission lines, 178–179, 204
Triple-bit faults, 177
Triple-modular redundancy (TMR) system, 4–5, 208, 260–262
Triple-well technology, 67
Triply redundant system, 255, 257


UltraSPARC-II-based servers, 3
un-ACE bits, 90–92
Uncached load data, 231, 233
Uranium, 2, 62
User-visible errors, 6, 80, 123, 148, 152. See also Soft errors


Verilog, 102, 152
Virtualization layer, 298
Virtual machine monitor (VMM), 313
VMM-level recovery, 322
Voluntary rendezvous opportunity (VRO), 227


Watch-dog processor, 300
Weapons Neutron Research (WNR), 47, 64–65
White neutron beam, 63–64
Windows hardware quality labs (WHQL) tests, 218
Windows NT® reboots, 218
Wirebond-type packages, 63
Write-back cache, 127, 131–132
CAM array, 144
RAM array, 141
Write-through cache, 127, 131
CAM array, 144
RAM array, 141


z6 architecture, 220
z990 architecture, 220
