Numerics
24-hour rule 195-196
A
action items 205-207
follow up on 207
ownership of 205-206
actionable, alerts as 98
after-action reports 192
Agile method 241
alert criteria 97-103
noisy alerts 101-103
thresholds 99-100
alert fatigue. See on-call rotation
@all alias 236
ALTER TABLE statement 175, 210
approvals
automating 22-25
capturing purpose of 18-19
process for 20-22
acceptable risk of change 21-22
necessary people are informed 21
no conflicting actions 21
work is in appropriate state 20-21
areas graph 56
artifact deployment 157
artifacts, defined 82
attributes 173
auth_users table 209-210
automation 6, 11, 17-153
approach for 141-152
automating complex tasks 152
automating complicated tasks 150-152
automating simple tasks 148-150
complexity in tasks 145-147
designing for safety 143-145
ranking tasks 147
safety in tasks 142-143
approval process 20-25
acceptable risk of change 21-22
capturing purpose of approvals 18-19
necessary people are informed 21
no conflicting actions 21
work is in appropriate state 20-21
business impact to 119-120
defining goals 132-137
automation as requirement in all tools
132-133
building automation into time estimates 136
prioritizing automation 133-135
reflecting automation as priority with staff 135
scheduled time for automation 137
time for training and learning 135-136
deployment pipeline 188-190
error handling 27-28
fixing cultural problems 126-131
cost of manual work 128-131
stop accepting manual solutions 126
supporting 126-128
improvements made by 117-119
frequency of performance 118
queue time 117-118
time to perform 118
variance in performance 118-119
logging process 25-26
notification process 26-27
prioritizing 131
automation (continued)
setting as cultural priority 121-123
priority 122-123
time 122
urgency 123
skill-set gap 137-141
building new skill set 140-141
reducing friction around support 139-140
staffing for 123-125
teams with monolithic skill sets 124
victims of environment 124-125
B
backlog, defined 282
bar graphs 56-57
base class 23
BetaAlgorithm 169
blogging 233-234
blue/green deployments 171-174
broadcastmessage.sh 150
build artifact 82, 86
business stakeholders, inviting to postmortems 198-199
buy-in from everyone 139
C
CAMS (culture, automation, metrics, and sharing) 5-6
change management policy 12
@channel alias 236
chaotic tasks 146
chat 234-237
chatbots 236-237
benefits of 236-237
shared responsibility with 237
company etiquette 234-236
limiting use of all channel notifications 236
short, live-topic focused channels 235
threading functionality 235-236
updating status regularly 236
checkout processing time 100
CLASSPATH variable 179
color points 222
commitments 274
complexity in tasks 145-147
automating
complex tasks 152
complicated tasks 150-152
simple tasks 148-150
complex tasks 146-147
complicated tasks 146
ranking tasks 147
simple tasks 146
composite alerting 100
concurrency 163
confidence interval 128
configuration files 184-187
configuration management modification 186
dynamic configuration through key/value stores 185-186
linking 186-187
configuration management 173
consumer_daemon 197-198, 202-204, 206
context 46
continuous integration (CI) 87, 181
continuous integration/continuous deployment (CI/CD) pipeline 135
core group of points 222
counters 34
crushing change control 156
cultural norms 240
culture 268
changing 244-255
creating rituals 251-253
culture chiefs 247-249
examining company values 249-251
sharing culture 244-247
using rituals and language to change cultural norms 253-255
defining 240-243
cultural rituals 241-242
cultural values 240
underlying assumptions 242-243
fixing cultural problems 126-131
cost of manual work 128-131
stop accepting manual solutions 126
supporting 126-128
influence on behavior 243-244
setting automation as cultural priority 121-123
priority 122-123
time 122
urgency 123
talent recruitment and retention 255-267
evaluating candidates 265-266
interviewing candidates 260-264
mindset 255-257
number of candidates to interview 266-267
obsession with senior engineers 257-260
culture chiefs 247, 274
culture pillar 5
Cynefin framework 145
D
dashboards 54, 64
naming 63-64
organizing 61-63
leading the reader 62-63
rows 61-62
starting with user 53-54
widgets 54-58
bar graphs 56-57
gauges 58
giving context to 58-61
line graphs 56
database deployment 157
database-level rollbacks 175-178
rules for database changes 175-178
versioning databases 178
DEBUG level 46
departmental goals 272
-depends flag 183
deployment artifacts 179
configuration files in packages 187
configuration file linking 186-187
configuration management modification 186
dynamic configuration through key/value stores 185-186
deployment artifact rollbacks 174-175
package management 179-184
deployment pipelines 82
detection score 40, 42
DevOps 7
history of 2-4
motivation for book 7
pillars of 5-6
what DevOps isn’t 4
DevOps Days conference 3
DevSecOps 3, 89-90
direct_delivery_items query 210
documenting postmortems 207-210
action items 210
cognitive and process issues 210
incident details 208
incident summary 208
incident walk-through 208-210
Donut plugin 257
DROP COLUMN statement 177
E
Eisenhower decision matrix 276
end-to-end tests
limiting number of 79-80
overview 73-76
error counts 34
error handling 19, 27-28
ERROR level 46
error rates 34
F
facts 173
false value 83-84
FATAL level 46
feature deployment 157
feature flagging 83-84
rollbacks 168-169
when to toggle off 169-170
feedback loop 165
fleet deployment 157
fleet rollbacks 171-174
FMEA (failure mode and effects analysis) 40-43
example process 41-43
metrics from incidents and failures 43
scope 40-41
team 40-41
FPM (Effing Package Management) 180
frequency of execution 120
frequency of performance 118
G
gatekeeping and gatekeepers
evaluating gatekeeper behavior 220
example of 13
problems created by 13-16
gauges 34, 58
goals
overview 270
tiers of 270-274
departmental goals 272
getting goals 273-274
organizational goals 271-272
team goals 272-273
grains 173
H
hash functions 143
hashed database 143
@here alias 236
human resources, inviting to postmortems 199
I
important box 276
incident reports 192
INFO level 46
information hoarding
chat 234-237
chatbots 236-237
company etiquette 234-236
how it happens 213-214
making knowledge discoverable 223-234
learning rituals 229-234
structuring knowledge stores 223-228
structuring communication effectively 221-222
defining audience 222
defining topic 221
outlining key points 222
presenting calls to action 222
unintentional hoarders 214-220
abstraction vs. obfuscation 217-219
access restrictions 219-220
documentation not valued by 215-217
evaluating gatekeeper behavior 220
install.sh script 179
integration tests 67, 72-73
intended audience section 63
intentional hoarding 214
is_approved method 23
iterations, defined 280
J
JSON (JavaScript Object Notation) 44
K
key performance indicators (KPIs) 113
key/value stores 185-186
knowledge discovery 223
knowledge retrieval 223
knowledge stores 223-234
learning rituals 229-234
blogging 233-234
hosting external events 232-233
lightning talks 231-232
lunch-and-learns 229-231
structuring 223-228
common lexicon 223-224
document hierarchy 224-228
structuring around topics 228
knowledge transfer 229
KPIs (key performance indicators) 113
L
language, sharing culture through 244
latency 34
leading the reader 62
learning rituals 229-234
blogging 233-234
hosting external events 232-233
lightning talks 231-232
lunch-and-learns 229-231
lightning talks 231-232
line graphs 54-56
linting 253
log aggregation 44-51
arguments for spending money 48-49
building vs. buying 49-51
hurdles of 48-51
identifying what to log 46-48
log message context 46-48
overview 44-46
logging process 19, 25-26
logrotate command 102
long-term objectives 205
lunch-and-learns 229-231
M
MappableEntityUpdateConsumer 202
mental models 194
messages.processed.count 38
messages.published.count 36
messages.queue.new_orders.size 37
messages.queue.size 37
metrics 6, 11, 33, 40
custom metrics 34-35
defining healthy metrics 39-40
failure mode and effects analysis 40-43
example process 41-43
metrics from incidents and failures 43
scope 41
team 40-41
vanity metrics 80-81
mortgage_calc function 71
N
noisy alerts 101-103
not important box 276
not urgent box 276
note widgets 62
notification process 19, 26-27
NULL column 176
O
occurrence factor 40, 42
off-hour deployments 190
automating deployment pipeline 188-190
deployment artifacts 179-187
configuration files in packages 184-187
package management 179-184
example of 154-156
layers of deployment 156-158
making deployments routine 159-164
accurate preproduction environments
159-160
staging environment 162-164
reducing fear by reducing risk 167
reducing fear through frequency 164-166
rollbacks 168-178
database-level rollbacks 175-178
deployment artifact rollbacks 174-175
feature flagging 168-169
fleet rollbacks 171-174
on-call compensation talks 108
on-call PTO 108
on-call rotation
alert criteria 97-103
noisy alerts 101-103
thresholds 99
compensating for 109
increased work-from-home flexibility 109
monetary compensation 107
time off 108
defining 95-97
time to acknowledge 96
time to begin 97
time to resolve 97
on-call support projects 112-113
performance reporting 113-114
purpose of 94-95
staffing 104-106
tracking on-call happiness 109-112
delivery method 111
individual being alerted 110
level of urgency 110-111
timing 111-112
operational automation 117
operational blindness 51
changing scope of development and operations 31-32
example of 30-31
log aggregation 44-51
arguments for spending money 48-49
building vs. buying 49-51
hurdles of 48-51
identifying what to log 46-48
log message context 46-48
overview 44-46
operational visibility 33-43
custom metrics 34-35
deciding what to measure 35-39
defining healthy metrics 39-40
failure mode and effects analysis 40-43
understanding product 32-33
opportunity cost 155, 216
optimal maximum size 106
organizational goals 271-272
P
package management systems 174
packages 187
configuration files in 184-187
configuration file linking 186
configuration management modification 186
dynamic configuration through key/value stores 185-186
package management 179-184
pair programming 247
Pareto principle 228
paternalist syndrome 28
automation 17-28
approval process 20-22
automating approvals 22-25
capturing purpose of approvals 18-19
error handling 27-28
logging process 25-26
notification process 26-27
creating barriers instead of safeguards 9-12
ensuring continuous improvement 28
gatekeeping and gatekeepers 12
example of 13
problems created by 13-16
PeerApproval class 23
perceived usefulness 77
performance reporting 113-114
pipeline executors 85
postmortems 43-211
action items 205, 207
follow up on 207
ownership of 206
choosing whom to invite to 198-199
business stakeholders 198-199
human resources 199
project managers 198
24-hour rule 195-196
mental models 194
defined 192
documenting 207-210
action items 210
cognitive and process issues 210
postmortems (continued)
incident details 208
incident summary 208
incident walk-through 208-210
example incident 197-198
sharing 210-211
detailing each event in 199-200
primary on-call person 96
prioritization
consciousness 274-279
Eisenhower decision matrix 276
priority vs. urgency and importance 274-276
saying no to commitment 277-279
structuring team’s work 280-283
populating iteration 281-283
time-slicing 280-281
unplanned work 283-289
controlling 283-286
dealing with 286-289
process rituals 251
project managers, inviting to postmortems 198
properly prioritized alerts 98
publishers of systems 36
pulling data 57
pushing data 57
Q
QA (quality assurance) 65
quality 65-91
continuous deployment vs. continuous delivery 81-83
DevSecOps 89-90
feature flagging 83-84
pipeline execution 84-87
test suite 76-81
avoiding vanity metrics 80-81
failing immediately after encountering failure 77
isolating test suite 78-79
limiting number of end-to-end tests 79-80
not tolerating flaky tests 78
testing infrastructure management 88-89
testing pyramid 69-76
end-to-end tests 73-76
integration tests 72-73
overview 66-69
unit tests 69-71
quality as a condiment antipattern 66, 89
queue latency 37
queue time 117-118, 133
queueing systems 35
R
READ LOCK 209-210
read replica 53
release cramming 156
releases 82
retrospectives 192
risk priority number (RPN) 40, 42
rituals
cultural rituals
creating 251-253
defining 241-242
embracing failure with 252-253
sharing culture through 246-247
using and to change cultural norms 253-255
learning rituals 229-234
blogging 233-234
hosting external events 232-233
lightning talks 231-232
lunch-and-learns 229-231
rm -rf * command 142
rollbacks 168-178
database-level rollbacks 175-178
rules for database changes 175-178
versioning databases 178
deployment artifact rollbacks 174-175
feature flagging 168-170
fleet rollbacks 171-174
runbooks 98
rushed features 156
S
S3 (Simple Storage Service) 102
SaaS (software-as-a-service) 44
safety in tasks
designing for 143-145
acquiring operator’s perspective 144
avoiding unexpected side effects 145
confirming risky actions 145
never assuming user’s knowledge 144
overview 142-143
scorecards 127
SELECT query 209
SELECT/INSERT statement 177
senior engineers, obsession with 257-260
hiring junior engineers 259-260
removing years of experience 258-259
service-level objectives (SLOs) 96
severity 40, 42
sharing 6
culture 244-247
through language 244-245
through ritual 246-247
through story 245-246
sharing (continued)
postmortems 210-211
problems through conversation 256-257
short-term objectives 205
Sidekiq node 209
simple tasks
automating 148-150
overview 146
skill-set gap 137-141
building new skill set 140-141
reducing friction around support 139-140
skin-in-the-game concept 139
social rituals 251
sprints 280
stacked line graph 56
stand-up meetings 241
standard change 21
stated values 249
structured logs 44
subject-matter experts (SMEs) 41, 229
synthetic transactions 162
T
talent recruitment and retention 255-267
evaluating candidates 265-266
interviewing candidates 260-264
identifying passion 263
interview panel 261
structuring interview questions 261-262
technical interview questions 263-264
mindset 255-257
number of candidates to interview 266-267
obsession with senior engineers 257-260
hiring junior engineers 259-260
removing years of experience 258-259
team goals 272-273
test coverage 80
testing
test suite 76-81
avoiding vanity metrics 80-81
failing immediately after encountering failure 77
isolating test suite 78-79
limiting number of end-to-end tests 79-80
not tolerating flaky tests 78
testing infrastructure management 88-89
testing pyramid 69-76
end-to-end tests 73-76
integration tests 72-73
overview 66-69
unit tests 69-71
TEXT field 175
thresholds
alert criteria 99-100
giving context to widgets through threshold lines 59-60
throughput 33
time to acknowledge 96
time to begin 97
time to perform 118
time to resolve 97
time-slicing 280-281
timely alerts 98
TLS (Transport Layer Security) 159
tool chain 132
true value 83-84
U
UI tests 73
unintentional hoarding 214
unit tests 69-71
determining what to test 71
determining what to unit test 71
structure of 70-71
unplanned work 283-289
controlling 283-286
coworker unplanned work 284-285
doing it vs. defering it 286
system unplanned work 285-286
dealing with 286-289
update-orders.sh 148-149
urgent box 276
V
values
defining 240
examining company values 249-251
vanity metrics, avoiding 80-81
VARCHAR(10) 175
variability 164
variance in performance 118-120
W
WARN level 46
waterfall model 281
widgets 54-58
bar graphs 56-57
gauges 58
giving context to 58-61
through color 58-59
through threshold lines 59-60
through time comparisons 60-61
line graphs 54-56
work queue systems 280
Y
yum downgrade command 189
yum install httpd command 142
yum update funco-webserver 188