Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7 The empty toolbox

This chapter covers

Compounding problems of a lack of automation
Leveraging other parts of the organization to build on your automation
Prioritizing automation
Evaluating the tasks to automate

The focus on products and features that the organization can sell often dominates our minds as technologists. But what many people fail to realize is that the tools you use to build these products and features are just as important. A carpenter without the right tools will always have a piece of wood too long, a nail protruding out, and a corner that just doesn’t look right.

In the world of technology, these imperfections caused by a lack of good tooling are often buried away in a series of small tasks that take too long, are difficult to reproduce, and are error-prone. But no one sees how the sauce is made. No one realizes that the image on the front page of the website has to be resized manually and is prone to being a few pixels off. No one knows that the configuration file reload is done by an engineer connecting to each website and running a command to reload it. And no one knows that out of 50 servers, sometimes a few get missed.

I call this lack of investment the empty toolbox. It creates additional drag on teams and their ability to address repeat problems quickly. Instead of reaching for the hammer when a nail is loose, everyone wastes time looking for a suitable hard, portable surface they can use to pound down the nail. It’s slower, not as effective, and doesn’t produce a result nearly as consistent.

You must think beyond just the product or feature being built when it comes to tooling and automation. You must think of the entire system around the product. People, support systems, database servers, and networks are all part of the solution. How you manage and deal with those systems is equally important.

When entire systems aren’t designed with automation in mind, a cascading series of decisions can lead to reinforcing that lack of automation. Imagine that one of the support systems for the platform requires a graphical user interface (GUI) to install software. This leads to a manual installation task. But if the software installation is manual, that becomes an assumption during deployments of the application. If the deployments are manual and the base installation is manual, that supports the idea of the actual server creation being manual. Before long, you have a long chain of manual events. That initial choice of a GUI-based installation leads to a continuing disinvestment in your automation strategy.

One of Gloria’s tasks in the morning is to reset the status of orders that failed because of a credit card transaction dispute. One day while performing this typically mundane task, Gloria makes a mistake. While cycling through her command history for her normal query, she inadvertently modifies a variation of the query that she had made while troubleshooting another issue. Instead of her query updating every order in a list of orders, it updates every order that is not in the list of orders--a catastrophic error that leads to a pretty big recovery effort to get data back into the correct state.

A manual process, no matter how well-documented, always offers the opportunity for failure. Humans are just not built for these repetitive tasks that computers excel at. This makes it a bit surprising how often organizations fall back onto these solutions, especially in the face of temporary problems. But if you’ve been in the industry for any length of time, you know that the word “temporary” has a huge range, from hours to years.

Sometimes the decision to rely on a manual process is a conscious decision after weighing the pros and cons of automation. But often the decision is not a decision at all. It’s a fallback into old, familiar patterns of solving problems. The implications of choosing a manual process over automation are not weighed or considered. Instead of critically evaluating the options and making a decision with eyes wide open, the inertia of the status quo prevails, and you fall back into the ease of a manual step.

Throughout this chapter, I’m going to discuss automation. Your mind may go specifically to IT operations, but I want to broaden the definition a bit. Operations are all the tasks and activities required to build and maintain your product. That might be monitoring servers, maintaining the testing pipeline, or setting up a local development environment. These are all tasks that are required to be performed to add value to the business. When these tasks are often repeated, standardizing the way those tasks are performed is key.

The focus of this chapter is to identify the problems with not investing in automation of tools that support your ability to deliver your product. I’ll go through various strategies to approaching automation as well as ways to engage and work with other subject-matter experts within your organization to help jumpstart a culture of automation. But before I do all of that, I’d like to make the case for why everyone should be concerned about operational automation.

DEFINITION Operational automation is the development of tools to support and automate the activities necessary to build, maintain, and support the system or product. Operational automation must be designed as a feature of the product.

7.1 Why internal tools and automation matter

Imagine you work at a construction company. The company is working on assembling a new building downtown. You’ve been assigned to the team, and the first thing they give you is a document with step-by-step instructions for building your hammer, your socket wrench, and your safety harness belt. Chances are that every person on the construction team is working with a slightly different set of tools in terms of quality and accuracy. It also means that a new construction worker is essentially useless for the first week or so after being hired. And as if that wasn’t bad enough, you probably don’t feel super comfortable about the rest of the process for constructing the building if that’s the process for getting the tools that are being used to build it.

Internal tools and automation are the foundation of keeping the day-to-day activities involved with maintaining and building software easy and less time-consuming. The more time you spend doing the basic needs of the craft, the less time you’re adding value to the product. Not developing tools and automation is often rationalized as a time-saver, but this is actually costing your teams more time, not less. It’s optimizing for the now at the cost of every moment in the future going forward. Manual work balloons into these small, discrete activities that individually may seem like minor time sinks, but collectively account for so much wasted time.

And the ways this time gets wasted can be insidious. People tend to restrict wasted time to the idea of “It took me 15 minutes to do this thing that should have only taken 5.”That’s one form of waste, but the biggest form of waste doesn’t involve working. It involves waiting.

7.1.1 Improvements made by automation

I don’t want to assume that automation and tooling is an obvious goal, especially since so many organizations don’t have a clear strategy for approaching it. There are four key areas where tooling and automation becomes important, the first being queue time.

Queue time

If you measure a process that is manual and requires more than one person to perform it, the common enemy that needs to be battled against is queue time. Queue time is the amount of time a task or activity spends waiting for someone to perform an action on it. You can almost be certain that anytime a task must pass from one person to another, queue time will be involved.

Think of all the times you had to submit a service desk ticket that should take only five minutes, but you spend two or three days waiting for it to be looked at and addressed. Queue time is the silent killer in many manual processes because it’s not wasted effort, but just wasted time, the most precious of resources. Tooling and automation can help you eliminate or reduce queue time by removing the number of handoffs necessary in a process.

Time to perform

Computers are just demonstrably better at performing repetitive tasks than any human could ever hope to be. If you must type three commands in succession, the computer will probably complete the entire series of tasks before you finish typing the command for the first. A simple task that takes two minutes begins to add up quickly over the course of a year. A two-minute task you perform five times a week is a little over 8.5 hours a year. An entire working day a year is spent on performing a boring two-minute task.

And not only is that two-minute task faster for the computer to perform, but the problem of automating the task is usually way more interesting and satisfying than performing the task itself. In addition to being more interesting, the work of automating the task demonstrates a capacity for both improvement and use of your automation skill set. Consider which of these sounds better in an interview: “I performed a SQL query successfully 520 times in the past year” or “I automated a task that was being performed 520 times so that no human had to be involved. We used the extra time freed up to automate other tasks so the team could do more valuable work.”

Frequency of performance

Like the preceding example, repeated tasks can be disruptive to an engineer’s ability to focus. When a task needs to be performed frequently and there’s urgency to the task, it can be difficult for a person to stop what they’re doing, move to the new task, perform the new task, and then switch back to what they were doing previously. This act of switching between tasks, referred to as a context switch, can be incredibly burdensome if you have to do it a lot. When a task needs to be executed frequently, that can sometimes lead to an engineer needing to context switch and lose valuable time while they get adjusted to or from the previous task they were working on.

Another issue with frequency of execution is the pure limitation it puts on how often a task can be run if it depends on humans. No matter how important the task, there is just no way you’ll have a human run the task every hour, unless it’s a critical part of their job. Automation allows you to break free and perform tasks at a much greater frequency if necessary and doesn’t create an outsized burden on anyone on the team.

Variance in performance

If I asked you to draw a square four times, there’s a high probability that none of the squares will be identical. As a human, a bit of variance always exists in the things that you do, whether it be drawing a square or following a set of well-documented instructions. It doesn’t matter how small the task may be; slight variations will almost always occur in the way things are done. There are all types of small variances that might make a difference one day in the future.

Imagine that you execute a process manually every day at 11 a.m. Or did you start at 11:02 a.m.? Maybe it’s 11:10 a.m. today because you had some traffic coming into the office and you’re a bit behind schedule. The people who are waiting for the results of the process don’t know whether the process is running slower than usual or you forgot to run the script. Is it a problem, and should it be escalated, or should the team just patiently wait? When looking at the logs of the script execution, they notice some days the task takes an extra 15 minutes. Is something wrong with the script on Tuesdays and Thursdays? Or is the human executing it called away to participate in the team morning meeting for a few minutes before returning and finishing things up?

7.1.2 Business impact to automation

These little examples highlight the type of variance that can happen in even the simplest of processes. Figure 7.1 highlights the extra time that is consumed and wasted in these areas by manual tasks.

Figure 7.1 A process flow that’s dominated by queue time and repeat work

This doesn’t even include things like the steps being followed accurately or data being inputted correctly. But these four areas describe some of the key benefits that automation gives you. If you understand what these areas mean, you can turn them into key business outputs.

Queue time means tasks and activities move through the pipeline faster, resulting in quicker business outcomes. The ability to perform a task quickly, without waiting on handoffs, means the pipeline of tasks moves more fluidly. Time to perform means you can do more tasks that were previously being tied up with manual work. That means higher productivity for the engineers involved. Frequency of execution means there are some things you can do more often and get the value out of those extra executions. Building on chapter 5, the automation of the test pipeline means you can execute test cases many times per day. Reducing variance in performance means you can be certain that the same task is executed the exact same way every time. This is language that auditors love. Your automation and tooling can ensure that it’s taking audit-safe inputs and producing audit-ready reporting.

Working on internal tools and automation will drive business efficiencies across these four areas, not to mention empower more staff to perform tasks that previously were saddled with requirements around access and knowledge transfer. I may not know your data model well enough to roll back payment information. But I can be trained to type rollback_payments.sh -transaction-id 115. This type of codified expertise allows you as an organization to push down the types of tasks that were previously done by extremely expensive staff members in the organization. If only a developer has the knowledge to perform a task, that’s something that is taking away from the developer’s time writing code. But if the task becomes automated, it can get pushed further down the line.

Imagine a typical workflow. Someone has an issue with something in the application and creates a support ticket. In many organizations, that ticket gets routed to IT operations. The operations department looks at it but knows it’s something wrong with the data, so it gets routed to a developer. The developer figures out the necessary SQL that will put the data into a usable state by all applications. They pass this knowledge down to operations, who then pushes it down to tier 1 customer support. This flow is necessary because of where the knowledge to solve the problem lives.

But once that knowledge has been identified, converting that knowledge into a tool or automation allows the developer to push the fix down to operations. Then once in the hands of operations, figuring out the hurdles of access and security for performing the task might allow them the ability to pass the task back to tier 1 support. Figure 7.2 highlights what the process could look like in an automated world.

Figure 7.2 Automating the proposed process

Now the customer is being serviced by tier 1 support immediately, instead of waiting for a series of handoffs, queue time delays, and context switches. Automation and tooling empower this process at every phase and highlight why they’re such a crucial part of the DevOps journey. Automation encompasses empowerment, speed, and repeatability.

7.2 Why organizations don’t automate more

Considering that automation has such an obvious benefit, it’s amazing that more teams don’t automate more. The reason is never as simple as a senior leader declaring that the organization isn’t going to focus on automation right now. Nobody ever says that. The driving force behind automation is more about having a clear understanding of what your values are and expressing those values through action.

If you asked most people if they think working out and eating healthy is important, they’d say yes. But if you ask them why they don’t do it, it’s because a bunch of other things have been prioritized ahead of it, which results in poor eating habits and a lack of exercise. I would love to wake up every morning and exercise. But I stay up late helping the kids with homework, then I want to prioritize time with my wife, and then I want to prioritize time for myself. These choices result in me going to bed at 11:45 p.m. and hitting the snooze button repeatedly until 6:30 a.m. Now my time to work out has been missed. Never did I consciously say, “I’m not going to work out this morning,” but through my actions, I created an environment where working out just wasn’t feasible. I deprioritized working out by prioritizing other things.

Organizations do this all the time with automation and tooling. This failure in prioritization happens in two key areas: organizational culture and hiring.

7.2.1 Setting automation as a cultural priority

When you discuss an organization’s values, you’re describing what the organization has decided is a priority. If automation and tooling aren’t in the conversation, it simply will lack the traction necessary to take hold in the organization. The organization builds a culture around what it will and won’t tolerate.

In some organizations, drinking a beer at your desk at 3 p.m. on a Friday is a perfectly normal sight. At other organizations, having alcohol in the building at all is grounds for firing. These are two extremes, but they exist, and the organizational culture enforces those norms, not just through rules and policies, but by the behavior of other people in the organization. It’s not going to be HR that yells at you for cracking open an ice-cold lager at your desk; it’s going to be the people around you. They’ve been conditioned by what’s acceptable in the organization and what isn’t. (I’ll talk a lot more about this in subsequent chapters.)

Imagine if your organization had the same behavior around automation. Imagine if implementing a manual process was tantamount to smoking a cigar in the change approval board meeting. Building that sort of culture is key to making sure tooling and automation is treated as a priority. Without this mindset, you’ll continue to encounter excuses as to why tooling and automation can wait until later.

Not enough time

There’s never enough time to automate something. In many cases, this is because the automation impacts folks at the team level but is solved at the individual level. For an example, imagine that a task gets performed once a week, every week, like clockwork. The task takes about an hour to complete. The team rotates who performs the task. Individually, it’s an hour a month; but as a team, it’s four hours a month of wasted energy. Automating the task and ensuring that it’s working might take a solid eight hours. In the heat of the moment, it’s always more advantageous to pay the hour cost, do the task manually, and move on. Investing your personal time and energy is a tough proposition when so many things are competing for your attention.

The lie that you often tell yourself is that “next time” you’re going to automate it. Or maybe when things calm down, as if there’s ever a time where everything isn’t on fire. The job of automating is always on the back burner, waiting for the perfect time when work is quiet, and you have eight contiguous hours to work on a task uninterrupted. This magic time never comes, and even though as an individual this seems like the smarter choice, as a team you could have that eight hours of work pay off in two months’ time. This is where the organization can make a huge difference in setting the tone of expectations.

Never prioritized

A sibling of the not-enough-time excuse is the “never prioritized” situation. Automation and tooling never become important enough to be put down on the road map as an organizational priority. Instead, they’re always relegated to free time or skunk-works projects. In fact, many times the teams that develop automation and tooling aren’t even an official team, but just a small band of engineers who have tired of the current way of doing things.

For automation to become a reality, you need to get it on the list of priorities for the organization or the department. With this recognition come resources, budget, and time, the most critical resource there is. But the prioritization doesn’t just occur for fixing old processes. It also has to become a priority in new work as well. Projects should have automation and support tools for the product defined as requirements and deliverables, as part of the project timeline. This sort of prioritization pairs the cultural value of automation with demonstrated action.

Urgency trumps correctness

Stop me if you’ve heard this one: “Well, what if we do it like this real quick for now and then in V2 we’ll fix it all up?” In so many scenarios, our urgency to get a thing done prevents us from doing things the way they should be, with proper automation and tooling. The hardest bad practice to stop is inertia. Once a bad practice is let out in the wild, it becomes harder and harder to stop unless it results in a complete disaster.

Inefficiencies in organizations don’t get the share of blame that they probably should. If a task is cumbersome, takes a long time, and wastes human resources--but works--it’s often difficult to convince people to pull resources from something that is all of those things and doesn’t work. The cost of doing it wrong now is often seen as a trade-off that works in favor of the business, but it almost never does.

It’s one thing if you’re doing it to take ahold of a magical opportunity that has a short time window and a huge payoff. But typically, it’s so that someone can meet a date that was randomly picked and then placed in a spreadsheet. The cost of doing it wrong ripples forward through time, saddling future actors with the burden of a bad choice made in a vacuum.

Each of these areas describes how automation is overruled by other priorities. The pressure from other forces can create a change that prevents you from doing what you know is the right thing with regards to automation. But sometimes the lack of automation can be more than just about prioritization.

7.2.2 Staffing for automation and tooling

When a company is initially formed, it usually hires based on its priorities. No company starts with the founder, a manager of HR, and a VP of Legal Affairs. These are all important roles in an organization, but companies grow into those roles as they align with their priorities. If a company is involved with a lot of litigation as part of its business model, a VP of Legal Affairs could make for a smart early hire. But if you’re building a new location services platform, a VP of Legal Affairs is probably further down your list of hiring.

The same thing occurs on a smaller scale. Most companies don’t start with a dedicated operations team until their support needs grow beyond what is capable for the development staff to handle. Dedicated security teams are often a later hire as well. An incredibly secure product that nobody uses won’t be on the market for long. You hire based on your needs and on your priorities. The same is true for automation. You may not necessarily need a dedicated automation expert, but if you’re going to make automation and tooling a priority, you need to make sure that your hiring practices reflect this.

Teams with monolithic skill sets

When you’re hiring a team, you tend to focus on having more of what you already have. After all, those employees are the blueprint for what it takes to succeed in the job. But this thinking leads to a monolith of skills among team members. People tend to hire people who are like them, especially in interviews. If I’m hiring for the skills that I already have, I have the ability to evaluate them more effectively, in my mind.

When an operations team hires a new systems engineer, the team thinks of all the things they do on their current day-to-day basis and evaluate engineers on that criteria. In addition, they have a sort of bias toward the skills that they use, and evaluate candidates under that lens. The result is a team with the same set of skills and the same sets of gaps. This happens on almost all teams.

If automation and tooling is your goal, you’ll have to hire for people with that skill set, especially if you don’t have that skill in-house. Now the focus on this particular skill set may mean that they lack in other areas that you might be hiring for. If you have a bunch of frontend engineers, your candidate for internal tooling and automation may not have as much frontend experience but will probably have a plethora of backend experience. This needs to be weighed when you evaluate candidates and think about the skills that you need now versus the skill that you’ve built a ton of expertise around in recent years.

This will probably feel a tad unnatural at first because of the industry’s unrealistic expectations around hiring. You’re probably not going to find that unicorn developer who has 30 years of iOS experience developing frontend, backend, virtual reality, Docker, Kubernetes, and Windows. You need to diversify the team’s skill set so that, as a unit, you can deliver what needs to be delivered.

Victims of the environment

The design of software impacts the ability to support it and automate it. Let’s use a simple example. Imagine an application that requires the user to upload a bunch of files to the system via their GUI-based application. An enterprising user might decide that it’s more valuable to script this upload and run it from the command line for every file in a given folder. But because the application was designed with only a GUI, the system doesn’t expose any other way to upload files.

Now instead of having a simple series of commands to upload files, the user is forced to create an application that simulates the mouse clicks and drags of a regular user. This process is far more brittle than the command-line option, and prone to errors, time-outs, modal windows stealing the focus, and a host of other things that can occur in a GUI-based session.

This mindset of supportability falls into the same product categories as stability, performance, and security. Like these items, supportability is a feature of a system or product and must be handled in the same manner. Without it being treated as another feature of the system, it will consistently be treated as a late add-on that must accommodate the already existing design constraints of the system.

Another by-product of this feeds into the skill-set gap. As the environment continues to grow with tools that don’t have the ability to be automated, the skills needed for performing automation begin to atrophy in the organization and the team. If there isn’t a lot of room for automation in the environment, the requirement for it begins to fade in the hiring process. People who have had the skill no longer exercise that muscle. Before long, the environment becomes something that enforces the lack of automation among the team.

I had lunch with a hiring manager for a consulting company. The company focused on bringing DevOps principles and practices to large-scale enterprises through consulting services. The problem this hiring manager had was finding Windows engineers with DevOps skill sets. He asked me to lunch to try to figure out why he was seeing this gap in skills between Windows engineers and Linux/UNIX system engineers.

A lot of factors go into this, but as someone who started my career supporting Windows systems, one of the major differences I noticed was the way the systems were designed to support automation. In Linux, everything from a configuration perspective was done either via a simple text file or via a combination of command-line utilities. This created an environment in which automation was not only possible, but straightforward to achieve. A wide variety of tools could help you manipulate text files, and anything that you could execute on the command line could be put into a text file and executed via a script.

In those days, Windows was a different story. With earlier versions of server-based Windows (NT, 2000, 2003), so much of your configuration or automation was wrapped into a pretty GUI interface. Users didn’t type commands; they navigated through menus and option trees, and found the right panel that enabled whatever it was they were trying to achieve. This made administration of Windows systems much easier. Instead of memorizing hundreds of commands, with just a vague idea of where a menu option might live, even the most basic administrators could find their way around the system.

The downside to this was that there was no need to build a skill set around automation, because there was no easy, reliable way to automate things. As a result, the automation skill set in the Windows community didn’t grow as fast as it did in the Linux/UNIX communities.

Microsoft has made a lot of changes to try to embrace the command line over the years. With the introduction of PowerShell in 2006, more and more management capabilities became accessible via the command line. And as a result, the industry is seeing an increase in the automation skills of Windows administrators. But this is a classic example of how your system design can influence not just the capabilities of supporting the platform, but also the skill set of the team responsible for sup-porting it.

7.3 Fixing your cultural automation problems

To have maintained success in creating tooling and automation in your environment, you must have a plan for changing your culture around automation. No matter how much energy you put into a big automation push, at the end of the day if you haven’t changed your culture, your existing tools will rot, and new processes will fall victim to the same issues and excuses that got you to this point in the first place. Like almost everything in DevOps, automation starts with culture.

7.3.1 No manual tasks allowed

This is going to sound incredibly obvious, but it’s one of the simplest things you can do to begin changing the culture: stop accepting manual solutions on your team. You’d be surprised at how much power the word “no” can have. When someone comes to you with a half-baked process that isn’t supportable, say no and then state what you’d like to see as part of the transition process.

You’re probably coming up with a thousand reasons as to why saying no is impractical. You might be thinking that some tasks are presented as top-priority requests. You feel like you won’t be able to get support or buy-in from your leaders by just saying no. If your leadership has expressed that automation and tooling are a priority, you can highlight that as a reason for refusing a manual task. “I don’t think this process is complete, and I’d prefer to refuse incomplete work.” Point out to the requesters the areas that you think need to be improved or scripted. You cannot have “no” be the final answer and leave it at that.

7.3.2 Supporting “no” as an answer

Be sure to have a ready list of things you’d like to see changed about the process. You can even model your complaints around the impact to the four areas discussed earlier in the chapter: queue time, time to perform, frequency of performance, and variation in performance.

Using these as a guideline, make the case for how your workload will be impacted. When looking at the process, pay special attention to parts of the process with a reasonable degree of uncertainty or chance for error. If you were to make a mistake, what is the impact on the recovery time if you had to run the process again? How many handoffs are required between teams or departments for the process? How frequently will the task need to be performed? What’s the driver for the frequency? This is a big one to think about because in some situations growth becomes the driver for frequency. If a task grows in correlation with another metric, like number of active users, you should keep that in mind because it gives you an idea of just how painful the process could end up being.

Also bear in mind the pattern of the task: don’t get caught up in the specificity of the task, but rather the generic problem it solves. Running an ad hoc SQL query might be a one-time requirement for that specific SQL query. But before long, running a one-time SQL query becomes the pattern for how certain types of problems are solved. “Oh, if you need to reset that user’s login count, just log in to the database and run this SQL query.” These patterns become a drain on the team, are error-prone, and usually are evidence of an unspoken nonfunctional requirement. Evaluate the pattern of the task, not just the specific task itself.

Once you understand the overall effort of performing the activity manually, use that to understand the overall effort of automating the same task. Keep in mind that the effort involved with a manual task is usually continual. Seldom do you have a single execution of a task, which means the cost of a manual task is continually paid over time. When you detail all these areas, you have a reason for the “no” response. Later in the chapter, I’ll discuss putting an actual dollar value to the manual process versus the effort of automation.

A compromise between manual and automation

Sometimes saying “no” just isn’t in the cards. For whatever reason, a task is just too important or the automation too onerous to get it done. This is where I opt to keep a scorecard.

A scorecard lists all the manual tasks that you or your team have to perform, but you’d prefer were automated somehow. Each task has a score associated with it: 1, 3, or 9. The score reflects how burdensome the manual task is; the more burdensome the task, the higher the score.

Then the scorecard should be given a maximum value. This might be on an individual basis, a team basis, or an organizational basis. It ultimately depends on which level of the organization you’re tracking this.

Now that you have a scorecard, when new manual tasks are being requested of you or your team, score it based on the burden it will place on your team. Once it’s scored, see if the additional score will bring your team above the maximum value of the scorecard. If the maximum value on the scorecard has been reached, no new manual tasks can be given to the group without removing some of the tasks via automation.

This is not a per item ratio but a total score--so if your maximum value is 20, and you’re at 20 and want to add a task that’s been valued at a 3, you can’t simply automate a 1 to make room. You’d have to automate three tasks that have a score of 1, or one task with a score of 3 or one task with a score of 9. Table 7.1 shows what a sample scorecard could look like.

Table 7.1 An accounting of all the manual tasks the team must perform, with score rating

Task	Score
Uploading of financial reporting data	1
Export of tier1 customer orders from the data warehouse	9
Cleanup of cancelled orders	9
Creation of feature groups by customer	3
Total--maximum value set at 20	22

This may sound like a basic process, but it creates a simple way for teams to track the amount of manual work they have in the organization and the pressure it’s putting on you or the team. If I asked you right now to list all the tasks that you’d like automated, you probably wouldn’t be able to do it completely. And even if you did, you probably would have a hard time reasoning about the difficulty of each one. I’m sure the big one that you consider a pain will come right off the top of your head. But big problems usually meet resistance to automation because of the effort involved. And that doesn’t mean these smaller nuisance issues aren’t also a drain on your time. This scorecard approach gives you a means to catalog and track the difficulty of these tasks.

In a lot of organizations, this scorecard will always have items on it. In fact, it will probably be at its maximum value a lot of times. But it’s a way to keep your manual processes in check. If you let manual processes run rampant, you won’t realize it until everything you do is crippled. Then it can be hard to claw your way back from the brink. With the scorecard approach, you’re able to quickly assess where you’re at and whether you can make an exception to support an important piece of work. It’s also a great place to look to when you have time and want to automate work for yourself or the team.

7.3.3 The cost of manual work

Every organization has to deal with money and financials. Even if you’re in a nonprofit, how money is spent, and the benefit received from that expenditure, is the lifeblood of every organization. Technologists generally do a poor job of translating the trade-offs that are made into the language of finance. Numbers, whether they are arbitrary scorecards or budgetary dollars, help people make sense of the world.

If you’re shopping for a car, comparing comfort, features, and reliability is a lot more difficult than comparing $5,000 versus $20,000. Even if you care about those characteristics (comfort, reliability, features), assigning a dollar value to them gives you a much clearer idea of just how much you value them. All things being equal, I’m sure most people would prefer a fully loaded BMW to an economy KIA. But the minute you start adding dollars to the mix, the choice becomes more nuanced. You need to do the same thing for evaluating manual work versus automation.

Understanding the process

The first thing to do for both the automated work and the manual work is to decompose the process into the high-level steps that consume time and resources. Once you understand the process, you can begin assigning time estimates to it. Assigning estimates is arguably the hardest portion of this, because you may not have data, depending on whether it’s a new or existing process.

For this example, I’m going to assume that you have no data beyond your own experience and intuition. For each step in the process, estimate how long it would take, but instead of giving a single number, give a range. The range should have a lower bound and an upper bound. The range you give should have a 90% chance of containing the correct answer. This is known as a confidence interval in statistics, but you’ll borrow it here for our purposes. Here you’ll use it to represent both the realities of our uncertainty and the variability in task execution.

For example, if I wanted to estimate how much time a ticket spent in the queue, I would estimate that 90% of the time, a ticket is in the queue waiting to be performed between 2 hours and 96 hours (four days). Now you need to be careful about your range for this to be a useful exercise. It’s easy to say, “I believe the number will be between five seconds and 365 days.” That’s not helpful to anyone.

A good way to think of it is like this: if you sample every performance of the task, 90% of the time it will fall within this range. But only 90% of the time. If you’re right, you win $1,000. If it falls in the range 93% of the time or higher, you will have to pay out $1,000--meaning, if you create the range to be too high, you’ll be right more often, which increases your chance of having to pay out. This calibration trick has been popularized by Douglas W. Hubbard and is an easy, quick way to try to keep you honest about your estimates. Figure 7.3 shows the workflow but annotated with time estimates instead of absolute values.

Figure 7.3 The time ranges give a more robust detailing of the variance in the execution.

Putting the estimates to work

Now that you have a series of time estimates, you can tally up the lower bounds of all the steps and the upper bounds of all the steps to get an idea of what the total process confidence interval can look like. (Again, these are all based on estimates, so there’s room for error.)

With this information, you also must think about how often this task will run. The frequency of the task is important because that compounds the amount of time being spent on it. If it runs weekly, that’s obviously much worse than if it runs annually. Tally up the number of estimated executions over a given time period.

I like to think of it in chunks of six months. So, you estimate that this will run 3 to 10 times over a six-month period. Multiply that lower and upper bounds by your total lower and upper bound estimate for the task. You’ve now got roughly the number of hours could be spent on the process every six months. It’s a range, so you’re admitting that this is variable, but at the same time, the range is way better than the information you had before, which was basically nothing.

You’ll need to do the same thing for the automation work, with two major differences. For starters, you’ll need to include a maintenance time for the automation. Very seldom do you write code and never have to touch it again. You’ll need to estimate how much time you think will need to be spent ensuring that the code performs as expected. If you’re not sure, an hour a month is a good rule of thumb. There will be many months when you don’t use up your hour, and then some months when you spend four hours supporting it. The second major difference is that you won’t multiply the initial automation work. Once you’ve made that first initial investment, that energy isn’t repeated as part of normal operations. This is where the big savings come in.

With these estimates in place, you’ll have a range of values for both the manual work and the automated work so you can make comparisons. Now this is simply time spent. It’s beneficial to keep the model simple at first. Some folks might argue that it doesn’t include how expensive different people’s time might be. This is a true concern.

Table 7.2 shows what the calculations might look like. To keep the example simple, I did the calculation based on only three executions.

Table 7.2 Steps to perform a process

Task	90% confidence interval for time spent	Time spent per 6 months (3 executions)
Submit ticket	5-15 minutes	15-45 minutes
Waiting in queue	2-48 hours	6-144 hours
Perform task	10-65 minutes	30-195 minutes
Report results	1-10 minutes	3-30 minutes
Totals	2 hours, 16 minutes to 49 hours 30 minutes	6 hours, 48 minutes to 148 hours 30 minutes

Table 7.3 is a similar table put together for the effort to automate the task.

Table 7.3 Effort required to automate the same process

Task	90% confidence interval for time spent	Time spent per 6 months
Requirements gathering	4-18 hours	4-18 hours
Development effort	2-40 hours	2-40 hours
Testing effort	4-20 hours	4-20 hours
Maintenance effort	1-4 hours	3-12 hours
Total	11-82 hours	13-90 hours

The results for our sample are pretty interesting. With three executions, you see that the time spent executing the task manually and the time spent automating that task overlap each other. Someone might make the case that automating this task might not be worthwhile after all. But here are a few things to keep in mind:

The estimate is for only 3 executions. If you increase executions to 10, the numbers become much higher for time spent in the manual process.
The execution time is only six months. If you planned on performing this task for longer than six months, much of the development is a one-time expenditure. Only maintenance continues to occur. This makes automation look even more attractive for the long run.

This is a quick and easy way to start comparing automatic and manual tasks. Some people might complain that it’s not robust enough to model all the complexities involved. They’d be right, and if you’re willing to put the energy in to do a more formal cost-benefit analysis, it might be worth it. But in most cases, that analysis is never done, and this quick method gives you an idea of whether the automation of the task is even in the ballpark of valuable actions to take.

In the preceding example, I think automation makes a lot of sense, given I modeled only the best-case scenario. And if our imaginary process needs to continue beyond six months, then it’s a pretty open-and-shut argument: automation is well worth the effort.

7.4 Prioritizing automation

The idea of automation isn’t new or particularly novel. It has existed in one fashion or another for as long as computers have been around. The need for automation has grown out of necessity as the number of machines, services, and technologies have exploded as well as the complexity of the services they deliver. Twenty years ago, operations teams were staffed based on a server-to-admin ratio. That strategy in today’s world would bankrupt most organizations. Automation is no longer optional. If you’re going to be successful, automation must become a priority in your project, your work, and your tools.

7.5 Defining your automation goals

Earlier in the chapter, I discussed how automation efforts can center around four areas: queue time, speed of the task, frequency of execution, and variance in execution. These four areas are also great pillars to begin thinking about your goals for automation. By defining these goals, you’ll have a better understanding of what automation tools can be at your disposal.

Imagine that your company is excited about a new product with a high potential for positively impacting revenue. But you find out that the API is incredibly limited. Instead of being able to filter out the specific data you need, you’ll be forced to download a much larger dataset and to read through thousands of records you don’t need to get the handful of records you do need. This isn’t an ideal automation scenario, but what are your goals?

If your automation goals are around the frequency of execution, this might not matter so much. The task might take longer than it probably should, but it’s still freeing someone up from filtering the data via the user interface and performing whatever transactions are necessary. If the process takes an extra 25 minutes per execution, it still is worth the automation effort. You also get some of the other benefits, like variance of execution, because now you’re doing the tasks in a script that executes the same every time. Now let’s say that your major driver for this same feature was queue time. If the only way to execute the automation is by being logged into the console of the server, that hinders your ability to integrate it into the ticketing system for self-service execution.

When you start to look at automation, by starting with the four areas of queue time, speed to perform, variance in performance, and frequency of performance, you can begin to understand which of these factors you want to drive down with automation. There might be other reasons you want to automate a task, but often those reasons will roll up to one of these four concerns.

7.5.1 Automation as a requirement in all your tools

Seldom does a single tool deliver all the value necessary for your application. Sometimes you’ll piece together various components of software. For example, you might have GitHub for a code repository, Jira for ticket and bug tracking, and Jenkins for continuous integration and testing. These collection of applications, scripts, and programming languages in your environment are known as your tool chain.

When you’re evaluating tools, if automation is a priority, then it must be a feature of the tool chains that you choose. No family of six decides on a Toyota Prius as the family car, because it doesn’t meet the family’s requirements. If automation is a requirement, choosing tools that make that an impossibility is just as silly. I know that a ton of other constraints go into making a technology decision, and it’s easy to blame those constraints on a lack of automation. But how often have you presented the need for automation as a hard requirement? Have you expressed the cost in engineering effort it takes to manage a system that can’t be scripted?

When it comes to evaluating the automation capabilities of a tool, the following are a few questions you can ask in order to determine how the capability supports your automation goals:

Does the application have a software development kit (SDK) that provides API access?
Does the application have a command-line interface?
Does the application have an HTTP-based API?
Does the application have integration with other services?
Does the GUI provide features that are not accessible programmatically?

When you’re evaluating a tool, be sure that you’re considering these requirements. Explicitly state your automation needs in your list of requirements. If your evaluation team is giving more weight to some requirements, make sure automation is properly weighted in that evaluation. You don’t want automation potentially discounted because other requirements are considered more important. If you don’t ensure that the tools you use are up to your automation goals, then no matter how much energy you spend on the next section, you’ll never achieve your goals.

7.5.2 Prioritizing automation in your work

Prioritizing automation in your work is the only way to build a culture of automation. It can be easy to become overwhelmed by the list of tasks that you need to do on a regular basis. Sometimes it feels more productive to choose the path of least resistance to get a task done.

Sometimes the dopamine hit of checking an item off your to-do list will drive you to make some short-term decisions. But this short-term thinking has long-term implications as the number of manual tasks quickly pile up in the name of the “quick win.” When it comes to these quick tasks, they’re often the low-hanging fruit of automation. It’s in these trenches where automation can make the most difference. It’s important that you keep a vigilant mindset around automation as a core value and prioritize these types of tasks when applicable.

Many organizations use a ticket-based workflow. Work is represented by a card in a system, and that card moves through the workflow. In most ticket-based workflows, the amount of time it takes for a ticket to be completed is usually dominated by queue time. As I’ve said earlier, queue time is the amount of time a ticket spends waiting for someone to work on it.

For every minute you’re on your lunch break or home sleeping or working on another problem, that’s queue time that another ticket is accumulating. If you’re the requestor of that ticket, it can be extremely frustrating to be stuck waiting on someone to get around to running a simple command. A task that takes two minutes to complete might spend four hours in a queue. Would you wait on hold for four hours in order to ask a question that will take only two minutes to answer? Probably not.

Everyone loses in this scenario. It’s why so many companies are trying to push commonly asked questions to things like automated interactive voice response (IVR) systems or directing customers to a company website where their questions might be answered. The only sustainable way to eliminate queueing time is to reduce the number of items that must enter the queue. This is where automation comes in.

For these sorts of commonly requested tasks, putting some automation in place and creating a method for self-servicing can free your valuable time for your team. But to do this, you must make it a priority. There’s no easy way around this. You must forgo doing one thing so that you can do another.

But how do you evaluate when to put the effort into automating a process instead of just doing what’s necessary to get the request closed and out the door? This is what usually holds up most automation efforts. A request comes in to execute the same old manual process. A ticket is created to automate the manual process, but it gets forever banished to the backlog with a wishful whisper of “someday . . .” on the lips of its creator.

The failure is that the ticket needs to be executed while the pain of such a request is still fresh in the minds of its executors. If given too much time, the monotony of the request will fade into memory. Seize the moment and start the work to automate it sooner rather than later!

For starters, don’t let the ticket get into the backlog. This advice is especially pertinent for teams that do lots of interruption-driven work, like operations teams. Get it prioritized as soon as possible. Don’t wait for it to be evaluated with the rest of the work. Make a case for why it should be prioritized in the current iteration of work or in the very next iteration of work. Once it gets into the sea of other requests, it stands no chance of seeing the light of day again. This is where techniques like letting the on-call staff member prioritize their own work can become a huge win (discussed in more detail in chapter 6).

Once you have the ticket properly prioritized, you should think about how those tickets should be evaluated as good automation candidates or not. The first thing to think about is the frequency with which this request is made. One of the problems with automating tasks that are done infrequently is that the time between executions can be so large that many of the underlying assumptions made by the automation may have changed. For example, if you have a process for maintenance windows that you’d like to automate, that task may run only once or twice a year.

But what in the underlying infrastructure has changed since you made that initial script? Has the type of worker nodes that read from the database changed? Did you remember to update the maintenance script to support the new type of load balancer you’re using? Is the method for checking that a service has stopped correctly still valid? A lot of things can change in six months.

If you’re in this scenario of infrequent execution, ask yourself how much automated testing you can create around this script. Is it possible that some tests could be run on a regular basis to ensure it’s working? In the case of our maintenance script, maybe you can schedule it to run regularly against the staging environment. Maybe you can do it as part of your continuous integration/continuous deployment (CI/ CD) pipeline.

Whatever you do, the root of the solution is the same: the script needs to be run more regularly, whether in a testing environment or in real life. If you can’t come up with a good solution for scheduling the task more regularly, it might not be a great candidate for automation at this time.

The primary fear is that when it comes time to execute the script, the changes in the environment prevent it from running successfully. This failure breeds a sense of distrust against the automation suite. And when distrust begins to form, it can be incredibly difficult to earn it back, especially for particularly sensitive automation.

7.5.3 Reflecting automation as a priority with your staff

An organization that wants to make automation and tooling a priority also needs to distill those priorities into actionable items with regards to staff. Automation isn’t lacking just because staff members are lazy. There’s more than likely a structural component to how your teams are built and how you manage skill-set improvements in the organization.

7.5.4 Providing time for training and learning

Whenever organizations talk about training, it’s always assumed that a large expenditure gets you a week’s worth of instruction and then magically leaves your staff with all the skills necessary to achieve the goals and tasks at hand. In my experience, paid training seldom pays off the way it was originally intended.

Usually the training is too early, meaning that the training class is far ahead of any practical, day-to-day usage of the skills acquired. You get trained on the new NoSQL database technology, only to have a two-month gap between training and implementation of the technology. You know the old saying: you use it or lose it. The flip side is that the training is too late, and you’re stuck with a buffet of bad decisions made in ignorance. It feels like a no-win situation. The fix for that is to build a culture of continual learning.

When you depend too heavily on structured training classes, you’re subconsciously treating learning as an event. Learning becomes codified into a structured and rigid practice of consuming knowledge from experts. But with the advent of online training, Safari books, conference talks, and an endless stream of YouTube videos, learning doesn’t have to be a completely structured affair. You just have to make time for it.

And in our busy schedules with all of our various projects and goals, seldom do you see time dedicated to this idea of frequent learning. This lack of learning will shackle your team to the current way of doing things without a lens into what’s possible. Skill advancement and augmentation is already a fact of life in modern engineering organizations. Some of the top tools in use these days have been on the market for less than 10 years. Whether you’re looking at learning a new language to empower staff members to automate their workflows or at a tool that shifts the paradigm and brings automation along with it, you’ll need a plan to make sure you and your team members are always learning. All the solutions are a variation on the same concept. You must make time for it.

In my organizations, I treat learning like any other unit of work. I have team members create a ticket that gets tracked in our work management system, and that work gets scheduled and prioritized appropriately. If someone is trying to read a book, they might take the book and break groups of chapters up into tickets. Chapters 1-3 are a ticket, chapters 4-6 another ticket, and so on. This work gets scheduled and put into the work queue, and when the team members get to that ticket, they get up, walk to a quiet corner, and read. That’s what investment in these things could look like. If you’re saying continual learning is part of the job (and if you’re in tech and you’re not saying that, you should have your head examined), then it should be done during work hours. Requiring staff members to do all of the necessary learning on their own is not only unfair, but a quick path to burning out your employees.

No matter how you track your work, you need to find a mechanism to include learning in that system. This not only helps you give it the time it needs, but also makes that work visual and known.

Building automation into your time estimates

Another way to reflect this investment in automation with your staff is to build the time needed for automating tasks into every project, every feature, every initiative. It’s common for the automation or tooling of a process to be considered out of scope for the very project that’s bringing the work to fruition. As an extension of not accepting manual work, all project estimates should include the time necessary to automate the task.

This is important because it signifies that this automation isn’t just nice to have, but is part of the project and the deliverable. If you went online and looked for a recipe to make a cake, you might find that the recipe takes 10 minutes to make, but to your surprise, that doesn’t include the time it takes to frost the cake. Would you feel that 10 minutes was an accurate representation? Of course not--because a crucial (and arguably the best) part of the cake-making process wasn’t included. This is how you should feel about automation and tooling! If you wouldn’t hand a janky, half-baked process to a customer, why is it OK to give it to the people who make and support the product?

When you’re building your automation or tooling estimates, consider using the tools from the very beginning of the process. Don’t go through the entire project doing the tasks manually and then, when it’s complete, build the automation around it. As an example, if you need to run a SQL statement to change the configuration of an application and you intend to automate that, don’t wait for the production run to build that automation! When you’re doing your initial testing where the task will be needed, build it then. And then continue to iterate on it until it works. Then when you do it in lower environments, you’re reusing the tool you’ve already built and testing it to ensure that it will be solid by the time you move to production. This ensures that the tool gets built and is tested, and also speeds up all the other executions of that task during the testing cycle. You might be eliminating or reducing time spent in those four key areas, meaning that the project is able to move faster.

Scheduled time for automation

Many technical organizations have a period where they focus on bad technical decisions and attempt to fix them. This focus on technical debt sometimes doesn’t include things like automating old workflows. The focus is on things that don’t work at all, so things that are merely inefficient can sometimes get lost in the conversation.

As an organization, you should propose dedicated time within a certain time interval for teams to focus on creating automation around tasks. You’ll always have some manual work that falls through the cracks. But if you have a dedicated slice of time for focusing on improvements, it can make a huge difference. Start with a week a quarter for specifically targeting at least one or two work items for automation. Increase the frequency as you start to see the benefits of automating that work. If a week per quarter is too aggressive, find what works best for you. But once you’ve identified the interval, stick to it and make good use of the time.

7.6 Filling the skill-set gap

The plan for automation sounds great, but beyond the hurdle of prioritizing automation, some organizations will have to deal with the reality of a skill-set gap. For many reasons, the teams that need to leverage automation may not necessarily have the skills to create it. This needlessly stalls many automation efforts, mainly because of pride and a lack of optimization through the entire workflow.

Not having the skills to perform a task within a team is not a new concept. In fact, it’s the reason entire IT organizations exist. You wouldn’t expect human resources staff to roll up their sleeves and launch their text editor every time they need a new bit of data processed in their applicant-tracking software. Instead, they contact the technology department, where those skills and expertise live inside the organization.

That same interaction occurs inside the technical department. The expertise of systems management and support lives within the operations organization. But when an internal technology team has a need for their own day-to-day work, the common logic is that the team should be able to support all those activities themselves. The expertise lives in the department, but not in that specific team unit.

Is there value in having those stark walls in place? Most of the motivation for these walls centers around how the teams are incentivized, which I’ll discuss a bit later in the book. Each team is so structured and focused on achieving their own immediate goals that they fail to realize how the poor performance of other teams might impact their own goals. I’ll give an example.

The development team needs to deliver a feature by Thursday. Because the deployment process is so cumbersome, the release life cycle requires that the software teams have their work submitted by Monday, so the operations team can perform the deployment in the staging environment by Tuesday. With a production deployment needing to be done on Thursday, that leaves the teams with only Tuesday and Wednesday to perform any testing and break fixes. This might seem like an operations issue (why do deployments take so long?), but it’s really an organizational issue because of the impact. The issues may not even be solvable by the operations department alone. Maybe the application isn’t packaged in a way that makes a faster deployment possible. Maybe the workflow for the release is broken and puts undue burden on the Ops team.

The cause doesn’t matter; the problem is still that the company’s deployment process is slowing the cadence of deployments. From this viewpoint, the idea of collaboration between development and operations is a no-brainer. The walls of responsibility are just an organizational detail. You want more effort placed on the problem, not on who owns what. In addition, making the deployments faster would afford the development team more time to test. Now they’re not beholden to having the release code ready a day ahead of time to get it deployed. Reducing the cycle time of deploys has benefits for the development team as well.

All this is to say that the skills you need exist in the organization, and engaging the appropriate teams that can help you deliver on your needs isn’t just a good idea--it’s crucial to the DevOps philosophy. Use the in-house expertise to help you and your team build your own expertise. No one starts with a blank canvas and paints a masterpiece. They study existing works and learn from others. Technology is no different.

If the wins surrounding automation are so obvious, why do so many teams struggle with gaining traction on the automation mindset? Well, a sense of complete ownership exists among teams. The feeling is that production is owned by the operations team. If operations owns it, they want to feel comfortable with all the tooling necessary to support it. A bit of ego is most certainly at play here.

This isn’t a judgment or a condemnation. I suffer from the same ego and imposter syndrome issues as the next person. But there’s something incredibly humbling about asking someone for help in an area where you’re supposed to be the foremost expert.

Take that feeling and bury it. The field of technology is simply too vast for you to know everything about every area. Thanks to the internet, you’re constantly bombarded with videos of technical experts showing just how multifaceted and talented they are, leaving you to wonder whether you’re even qualified for your job. You are. You may not be a Brendan Gregg or an Aaron Patterson, but few people are. Don’t let that discourage you or your confidence. And don’t think that asking for help in one specific area diminishes your value as an engineer. Asking for help is not an admission of incompetence.

7.6.1 But if I build it, I own it

The fear of being perpetually stuck with owning something that you have only tangential involvement in is a real fear. I’m not trying to discount that this reality exists for a lot of people. But in a lot of scenarios, the fear is driven from the idea that your automation scripts aren’t considered a true part of the system and the overall solution. As certain changes to the system happen, the automation isn’t considered part of the testing strategy, but its breakage becomes a major impact for everyone who uses it.

This turns into an incident that requires the supporting engineer to immediately context switch from their existing work to firefighting mode. And everyone hates firefighting. There’s no getting around this risk, but you can minimize it with a few simple tricks.

Reducing friction around support

The first way to reduce this friction is to ensure that the automation being worked on addresses both groups (meaning the group that is asking for help on the automation and the group that’s actually performing the automation). If you’re asking someone to help you build something that has absolutely nothing to do with their day-to-day activities, human behavior is going to limit that person’s willingness to contribute. But if that assistance is going to make both your lives easier, the path becomes a lot easier. This is the skin-in-the-game concept.

If both sides have skin in the game, the effort needed to support it is much easier to muster. Because when it breaks, both teams are impacted. Development can’t move as fast, and operations will be stuck with using the manual process, taking cycles away from more value-added work. I’ll discuss this concept of shared incentives in greater detail later in the book.

Second to the skin-in-the-game concept is the idea of buy-in from everyone involved, ensuring that both sides are in active agreement on the approach to the problem and what the solution looks like. For developers, it’s key to understand that the other group probably has an idea of what the solution should look like. But they fall on the specific implementation details. Removing them completely from the solution will not only produce a bit of animosity, but also create disinvestment on the side of operations in the overall solution. Similarly, it’s important that operations listen to the development team around the scope of possibilities. The development team might take issue with an approach because of its long-term implementation and support implications, poor design, or perhaps not meeting the immediate need.

These are specific examples, but the underlying point is that collaboration is the most important aspect when engaging another group for help. Focus on solving the problem, not advocating for your specific solution. When teams collaboratively come up with a solution, the issue of long-term support suddenly seems to fade into the background.

A bit of history from the perspective of operations team members might be in order. Operations groups have historically been the last stop on the release train into production. Unfortunately, in many organizations they’re not involved in the process at all until it’s time to release the software. In many cases, it’s too late for operations to effect change in the way they would like to. This leads to a kind of mental disassociation with the solution, because it’s not theirs. They have no fingerprints on them. The solution happened “to them” not “with them,” so it gets treated as this sort of other thing.

Likewise, development has had a history of being treated like inept children by operations staff members. The barrier between them and production can feel a little artificial to them, and as a result, residual resentment can build. Their frustration is understandable. They can be trusted to build a system, but when it comes to operating it, they’re looked upon as completely incapable and dangerous.

These are broad characterizations, but I’ve seen it echo true with many engineers. It’s important to empathize with your collaborators so that you understand where feelings, objections, and viewpoints are coming from. Even if your organization doesn’t match this scenario, I assure you that some underlying hard feelings probably exist among the different engineers in your organization. Some of it might even be baggage from a previous job.

Owning part of the solution is a concern for many, but I challenge you to examine why you’re concerned with owning it. It often boils down to a lack of trust in the quality of the solution as well as in the influence that you can exert on it. When a tool adds value and is done well, supportability is rarely a concern.

7.6.2 Building the new skill set

Borrowing skills from another team is not a good long-term solution. It should be done only as a stopgap to allow time for the appropriate team to build up the skills necessary to take over support. This usually boils down to a lot of cross-training and mentorship. Once the initial implementation is built, encourage staff members to handle bug fixes and new features, while offering engagement opportunities with the original team members responsible for the initial development.

A series of lunch-and-learns can also go a long way, offering a group setting to allow people to ask questions, walk through the code together, and apply those lessons learned to new problems. Develop small coding challenges for the team to help them build their skill set. The misconception is that everything written needs to be deployed and used in a production environment. But if you can build your skill set by writing little utilities for your personal environment, you can quickly begin to gain confidence and competence in what you’re doing.

As an example, at a job I had an engineer who was learning Python read from the Twitter API and produce a list of the most commonly used words ranked in order. The code produced has absolutely no value in the workplace, but the knowledge gained by the engineer is invaluable. The Twitter project gave a specific problem to solve with no risk of real failure, but at the same time built on concepts that will be useful in real work scenarios.

If you’re in a leadership position, you might consider making these sorts of assignments part of the body of work that’s planned for the week, allowing people to dedicate time during the work week to level up their skills. Learning on nights and weekends is exhausting and sometimes not possible for a lot of people because it competes with the rest of life. But doing it during the work week and reporting on your progress during team meetings drives the importance that you, as a leader, place on this skill.

Another option is to change the way the team is built out in the future. Invert the requirements a bit if your team already has ample talent in one area. For example, if an operations team is already heavily staffed with systems people, maybe you change the focus of the next hire to someone with heavy development experience, but less systems experience, letting them learn that on the job. Diversification of skills is important for any team, so knowing where you’re weak and where you’re strong will help you fill the gaps.

The skill-set gap is a real problem with the automation portion of a lot of DevOps organizations. But with a little creativity and shared problem solving, your organization has all the talent you need; you just need to tap into it. Once you’ve tapped into that talent, you need to get that work prioritized by highlighting its importance to the organization. Automation is in fashion now, so it doesn’t take the same level of effort to sell leadership on it as in years past. But making the case for why it’s important is still worthwhile. You can always highlight the following three points:

What you’ll be able to do faster and how that impacts other teams
What you’ll be able to do more consistently and repeatedly
What additional work you’ll be able to do with the automated task mostly removed from your workload

7.7 Approaching automation

When you talk about automation, many people have different and sometimes conflicting ideas about what that means. Automation can range from a series of individual scripted actions to a self-adjusting system that monitors, evaluates, and decides programmatically on each course of action.

DEFINITION Automation is the process of taking individual tasks and converting those tasks into a program or script for easier execution. The resultant program can then be used in a standalone fashion or encompassed into a larger piece of automation.

When making a DevOps transformation, you should always start thinking about a task with automation in mind. If DevOps is a highway, manual processes are a one-lane tollbooth on that highway. Long-term manual processes should be avoided at all costs. The price is too high and only continues to grow. A case can be made that sometimes a short, time-bound manual process is more effective than an automated process. But be wary of “short-term” fixes. They have a habit of becoming long-term as the necessary work to remove them continues to get reprioritized. If you’re considering a short-term manual process, make sure all parties involved have incentives to eliminate it, or you’ll end up with a short-term process five years later.

The level of automation that exists in your organization will depend greatly on the technical maturity and capabilities of the team responsible for implementing it. But regardless of your team’s skill set, I assure you that every organization can execute and benefit from some level of automation. The automation may not only reduce the amount of work and potential errors that can happen in a system, but also create a sense of safety among the team members who use it if it’s designed correctly.

7.7.1 Safety in tasks

Have you ever run a command that has a surprising side effect? Or maybe you’ve run an rm -rf * command in Linux and have to double-check what directory you’re in about 20 times before you feel comfortable pressing Enter. Your comfort with running a task is directly related to the potential outcome if things go bad.

Safety in tasks is the idea that the outcome of the task, if done incorrectly, won’t produce dangerous results. I’m going to continue using my fear of cooking as an example. I’m always paranoid about baking chicken and not having it cooked thoroughly. The consequence of eating undercooked chicken could be dangerous! On the same note, I have no fear of cooking chicken tenders in the oven. Chicken tenders are typically precooked and frozen, so the consequence of undercooking chicken tenders is much safer than the consequence of undercooking raw chicken. They also have precise instructions with little to no variability. I don’t need to perform any modifications based on variables, like the size of the chicken; I just follow the instructions. You attempt to evaluate the risks based on what you know about the task. As you evaluate a task, you’re attempting to get a sense of how easy or hard each task is.

Why is safety important, though? Because when you begin automating tasks, you’ll need to think about the potential side effects of each task your automation is now executing. To a certain degree, the user has less control over the actions than they did previously.

Imagine you’re automating the installation of a program. Previously, the person performing the installation had complete control of every command being performed. They knew what flags were being passed and what specific commands were being performed. In a world where these steps have been wrapped into a single command, they lose that fine-grained control in exchange for simpler execution. Previously, the user owned the responsibility of safety, but now because of automation, that responsibility has passed to you, the developer.

I want you to think about safety as you’re automating work and to treat that responsibility with respect, whether the target users for your automation are external clients, internal customers, or even yourself. Good automation does this all the time. Think of something like the command to install packages on a Linux system. If you type in yum install httpd, the command doesn’t just automatically install the package. It confirms with you what package it found and is going to install, along with all the dependencies that will come with that. If your package conflicts with another package, instead of saying, “Well, the user said install it, so let’s just break everything,” it instead errors out with a failure warning. There are command-line flags you can specify to force the installation if you really mean it, but forcing you to specify that additional flag acts as a safety feature to ensure that you know what you’re really asking for.

You can create various levels of safety around a task. The amount of effort you put into safety is usually directly proportional to the amount of risk if things go wrong and the likelihood that they will go wrong. With a mental simulation of the tasks to be performed and some understanding of how complexity works, you can begin looking at your processes and thinking about where safety can be improved. You should start with looking at the automation of tasks.

7.7.2 Designing for safety

When you’re developing applications, a lot of energy is put into ensuring that the end user is well understood. Entire disciplines have been created around the user experience and the user-interface design to ensure that as part of the application’s design, you have an accurate understanding of who the most likely user is, what their experience and expectations are, and the type of behaviors they exhibit and ultimately inflict on the system. The thought process is that if it’s dangerous and the system allows the user to do it, then it is the fault of the system, not the user. For end users, this is a great proposition, and applications from Facebook to Microsoft Word are stronger because of this discipline.

But you don’t take a lot of these things into account when you’re developing systems to be run in production. Tooling to support the application is usually bare or nonexistent. Critical tasks that must be performed are often left to the lowest level of technology possible. It’s not uncommon for something as simple as a password reset for an admin account to be relegated to a handcrafted SQL query that everyone stores in their notes app with all the other esoteric commands that are needed to keep the system functional.

This is not only undesirable, but also dangerous when common tasks require steps to be taken outside the defined parameters of the application. Using the password as an example, someone might easily execute the command UPDATE users SET PASSWORD = 'secret_value' where email = '[email protected]'. The code looks pretty straightforward, and it’s easy to reason that this SQL should work. That is, until you realize the security implications: the password fields in the database are hashed.

NOTE Hash functions allow a user to map any sized data value to another data value of fixed size. Cryptographic hash functions are often used to map a user’s password to a hashed value for storage. Because hashing functions are one-way, knowing the hashed value isn’t useful for determining what input created that hashed value. Most password systems will take the user’s submitted password, hash it, and then compare the hashed value it computed with the hashed value that’s stored for the user.

After running the SQL statement, the password still doesn’t work. You could hash the password yourself, but now you’d need to know what hashing algorithm was used, among other things. In addition, many applications keep track of audit changes inside the database. But because this is an application function, the changes that you’re making via SQL probably aren’t being logged. As a result, the audit trail doesn’t show who changed the password or when they changed it. You also can’t tell whether the password change was nefarious in purpose or not. If the password changes are typically done in this fashion, a legitimate change and a criminal change look exactly the same. If the change was done through the application, you would expect to see an audit trail leading you to the person who performed the action. If you don’t see the audit trail, yet you confirm that the password was changed, it leads you down a different path of investigation.

The process for creating safe support actions for operators of the system isn’t radically different from what you would do in soliciting requirements and expectations from an end user. The person who runs the system is really just an end user with a different perspective on the system.

Never assume the user’s knowledge

It’s easy to assume that the person operating the system will have as complete and comprehensive knowledge as the person who is designing or developing it. This is almost never the case. As systems grow and become more complex, you must presume that the operator is working with a limited set of knowledge about the system. Assuming the user doesn’t have complete information will influence how you alert, inform, and interact with the user.

Probably most important, you shouldn’t anticipate that when presented with a decision point, the user will choose the most logical course of action for the situation. This holds true for written documentation or for complex self-regulating automation systems. This isn’t to say that your system should be designed so that anyone in the company can operate it, but the minimum expectations of the user should be well understood when the process is being created.

Acquire the operator’s perspective

UX engineers spend a lot of time with potential users of the system. One of the reasons is to gain insight from the user’s perspective into how they see and view the application they’re interacting with.

The same is true of operators of the system. Your perspective shapes how you view and interpret all the data that’s coming from the system. A log message that might be incredibly useful to a developer who is testing the software on their local workstation might be totally irrelevant and dismissed by an operator of the system in a production environment. The best way to get that insight is by sitting with operators and talking about the design of your process or automation. Get their feedback on how things should be handled. I assure you that their perspective will be valuable, and the way they view the same problem you’ve been staring at for weeks will surprise you.

Always confirm risky actions

The greatest fear I have in life is accidentally deleting my entire Linux system. The operating system allows you to make fatal mistakes without the benefit of a confirmation prompt. Want to delete every file on the system, including important system files? The system will let you do it!

Don’t let your processes do that. Whenever possible, if a step requires you to take some sort of destructive action, you should confirm with the user that they should do it. If it’s automation, a simple TYPE YES TO CONTINUE prompt should be sufficient. If it’s a manual process that’s being done via checklist, be sure to highlight on the checklist that the user is about to perform a dangerous step and to double-check the inputs before continuing.

Avoid unexpected side effects

When designing automation, you want to try to avoid performing actions that are perhaps outside the immediate scope of what an operator might expect from the system. For example, say a user is executing a backup script for the application server, but the script requires the application to be shut down first. It might seem intuitive to just have the script perform the application shutdown, but ask yourself whether this requirement is something that the average operator would know and be aware of. Informing the user that the application must be shut down first, before exiting, goes a long way toward preventing a surprise for the user.

Now that you have an eye out for ensuring that you don’t surprise the user with unintended behavior, you can begin to look at how complexity in the tasks you’re automating lead to different approaches and different sets of problems.

7.7.3 Complexity in tasks

All problems and their subsequent tasks have varying levels of complexity. Cooking is a perfect example. I’m a lousy cook. But depending on what I’m cooking, various levels of complexity may exist. Cooking chicken tenders is much simpler than cooking raw chicken.

Being able to order and categorize the complexity of a task has value because it gives us a starting point to think about how the task is approached. You don’t need a ton of prep work to cook chicken tenders, but you might add a little extra time when cooking raw chicken for the first time.

To understand these issues a bit better, it often helps to use a framework of some sort to understand and give language to concepts. For this case, I’ll borrow from David Snowden’s Cynefin framework^.1

NOTE The Cynefin framework, used as a decision-making tool, offers various domains. The definitions for his domains apply to this complexity conversation.

The Cynefin framework allows you to place the complexity of the problem into one of four contexts. The names for these contexts are simple, complicated, complex, and chaotic. For teaching purposes, I’m going to limit my usage to the first three contexts because addressing the chaotic is probably worth its own book and not something that general automation tips could reliably reason about.

Simple tasks

Simple tasks are those that have a handful of variables, but those variables are well-known and well understood. The way the values of the variables impact the necessary steps is also well understood.

An example might be installing a new piece of software. The variables might be your operating system type and the version of the operating system you’re running. Both variables, when changed, might impact the steps that you need to take to get the software installed. But because these values and their impact are well understood, they can be enumerated and documented ahead of time to help someone install the software for all supported operating system types.

As an example, the steps to download database software differ based on your operating system. If you’re using the RedHat-based operating system, you might need to download and install the software via an RPM package. If you’re using Windows Server, you might download an MSI installer. Despite these being two different methods to get the software installed, their steps are still well understood and can be detailed ahead of time.

Complicated tasks

Complicated tasks have numerous steps that are not easy or straightforward. They require various levels of expertise, but once you’ve done it, the task is often repeatable. An example might be manually promoting a database server from a secondary to the primary. Several steps require gathering more information as input to later tasks, so distilling the steps into simple tasks can be somewhat difficult.

Using the database server promotion example, you might have several decision points. If the slave database server you intend to promote is not fully in sync with the master, you might need to take a series of actions that then alter the steps necessary to perform the database promotion. With the appropriate level of expertise, you might be able to break these complicated tasks into a series of simple subtasks, but the execution of subtasks will most likely be done based on different decision points throughout the process. For example, if the slave database server is in sync, begin subtask X, but if it is not, begin subtask Y.

Complex tasks

Complex tasks typically involve many variables, with each variable potentially impacting the effects of the other variables. Fine-tuning a database is a perfect example of a complex task.

You cannot just set up a series of generic options and expect great performance. You have to consider your database workload, available resources on the server, traffic patterns, and trade-offs between speed and recoverability. If you’ve ever been on Stack Overflow, you’ve probably seen questions that are answered with, “Well, it depends.” That simple phrase is a signal that the forthcoming answer requires some expertise and some nuance in understanding. Complex tasks require a level of expertise in order to understand how different aspects of the task relate to and affect each other.

7.7.4 How to rank tasks

Something I’ve learned about over the years is “the curse of the expert.” It can be difficult to separate what you know from what the intended performer of the task knows. An important thing to keep in mind when you’re ranking these tasks is that the complexity of the task should be considered from the person who is executing the task, not from the perspective of your own expertise.

This can get tricky and might require cross-functional collaboration. If you were writing instructions for how to deploy software, the level of detail you use should change based on your target audience. If the instructions are for the operations group that has experience deploying similar code, your level of detail would differ than if you were writing it for someone with no context of the environment. Similarly, when you’re implementing a task to restore the database, the complexity from the perspective of implementing the task is different from the complexity of the person executing it.

Understanding the complexity level of a task allows you to think about how to approach executing the task through the viewpoint of safety. The complexity of the task doesn’t map directly to its potential negative impact. If you were making a cake, you would consider the task of “preheat the oven to 350 degrees” a simple task. But accidentally setting the oven to 450 degrees could be disastrous for the outcome. (As stated previously, I’m a horrible cook, so this might not be true at all.)

When you evaluate a task for automation purposes, you might instantly judge your ability to automate it based on the complexity of the task. But don’t be afraid to take on complex tasks that have low risk if they fail or are incorrect. If the outcome of the automation is low risk, learning about the complexity of the task through trial and error might not be that bad. Think of an unintended installation. The worst thing that could happen is that you must start the installation over again. Tackling this task, which on its face might seem complicated, is pretty risk free. You can iterate on it over time and get the process just right. Automation that deletes data in the database, however, should probably have a much higher degree of certainty.

You want the operator to be able to execute a task, knowing that different levels of safety are built into the task to protect against disastrous consequences if they make an error. Think about the anxiety level you might have if I gave you a complex task but told you that if you got it wrong, you’d get an error message and could try again. Now imagine that exact same task, but now I tell you that if you get the command wrong, you could shut down the entire system. That shift in user anxiety is why you want to consider levels of safety in your task creation.

When looking to automate a task, you might be confused about whether you should be attempting to rank the complexity of the individual, discrete tasks or the complexity of the overall problem itself. For example, do I rank the task of cooking chicken? Or do I rank the four individual tasks that result in my overall solution of cooking chicken?

I would focus on ranking the individual tasks. Generally, the problem will be categorized based on the most complex task within it. As an example, if you have a problem that has four underlying tasks, and you ranked three of the tasks as simple and the final task as complex, the overall problem would be considered complex.

7.7.5 Automating simple tasks

Automating simple tasks is a great place to start to introduce safety into your processes and systems. Simple tasks are typically orderly and easy to codify into a scripting language. If this is your first attempt at automating a task, the advice is to always start small and simple. Also focus on tasks that are executed frequently; this way, you can continue to get feedback on your automation. Having your first piece of automation be something you run once a quarter doesn’t give you a lot of opportunity to learn and tweak the process.

The simplest way to start is to focus on small objectives of the automation and to slowly build on them. If you look at the final vision for your automation to start, you could quickly become overwhelmed with potential problems and hurdles. A simple goal might be just getting a multistep task executed via a single command-line utility. For an example, going back to Gloria from the beginning of the chapter, let’s say her process looks something like the following:

Get a list of all orders in the failed state.
Verify that the orders are both failed orders and orders that have been cancelled by the payment processor.
Update the orders to the new state.
Verify that those orders have moved to the new state.
Display the results to the user.

These steps seem easy and straightforward, but they can still cause issues if a step is missed. What if the operator issues the wrong SQL command in step 1? What if the operator updates the wrong orders to the new state? These steps can easily be codified into a simple shell script as a first attempt at automation. Take a look at the following code I’ll call update-orders.sh:

#!/bin/bash
echo "INFO: Querying for failed orders"
psql -c 'select * from orders 
   where state= "failed" 
   and payment_state = "cancelled"'                                    ❶
echo "INFO: Updating orders"
psql -c 'update orders set state = "cancelled" 
    where order_id in 
      (SELECT * from orders 
          where state = "failed" 
          and payment_state = "cancelled")'                           ❷
echo "Orders updated. There should be no more orders in this state"
psql -c 'select count(*) from orders 
    where state= "failed" 
    and payment_state = "cancelled"'                                  ❸

❶ The query shows how many orders are currently in the failed state and lists them.

❷ Updates the orders in the failed state

❸ Redisplays those orders to the user

This seems incredibly simplistic, but it provides a level of consistency that allows for an operator to feel comfortable and safe in its execution. It also gives you repeatability. Every time someone executes this process, you can feel confident of the steps taken and the outcome. Even with a simple task such as this, executing a single line of update-orders.sh seems easier and preferable to the five-step process detailed previously.

Astute readers may ask themselves, “How do I know one of these steps doesn’t fail?” It’s a fair question, but what if the step failed when you were executing it manually? You’d probably have the operator stop and perform some sort of initial troubleshooting. You can do the same with your automation, prompting the user whether to continue after each step.

The modified script follows. It’s a bit longer now, but it gives you an opportunity at key parts of the script to exit if the operator sees something that doesn’t seem correct in one of the previous command outputs. Again, remember that you’re focused on getting started, not on the automation winning a hackathon contest.

#!/bin/bash
echo "INFO: Querying for failed orders"
psql -c 'select * from orders 
where state= "failed" 
and payment_state = "cancelled"'
response=""
while [ $response != "Y" || $response != "N" ]; do
     echo "Does this order count look correct? (Y or N)"
     read response                                        ❶
done
if [ $response == 'Y' ];then
    echo "INFO: Updating orders"
    psql -c 'update orders set state = "cancelled" 
             where order_id in 
                (SELECT * from orders where state = "failed" 
                and payment_state = "cancelled")'
echo "Orders updated. There should be no more orders in this state"
psql -c 'select count(*) from orders 
    where state= "failed" 
    and payment_state = "cancelled"'
else
 
____

fi
echo "Aborting command"
exit 1 #

❶ This section gives the user context for an action taking place and gives them the opportunity to abort.

In a more advanced script, you might try to do some additional error handling to get a sense of how you might be able to recover from a condition. But for your first take at automation, this is more than enough. It gives the operator the safety to do the job with a consistency that makes the script’s execution more predictable than if you left it to the operator. The message is consistent, the length of time to log off is consistent, and the order of execution is consistent.

As you become more comfortable with the possible points of deviation, you can begin to introduce more error handling and less user confirmation. If the script runs 500 times without any failure in the broadcastmessage.sh portion of the script, you might decide that the extra handler isn’t providing much value in comparison with how much it’s costing you to verify each step. The key is to not assume that your automation software is static, no matter the level of sophistication. You can continue to iterate to make it better, more robust, less user-dependent, and generally more helpful. Next steps in this automation life might be as follows:

Recording the start and stop time of the entire
Logging how many records were updated to an audit system
Logging the dollar value of failed orders for accounting purposes

You can continue to improve your automation as you become more comfortable with the task and its possible outcomes. Don’t let perfect be the enemy of good. Automate the steps that you feel comfortable with, and as you gain more expertise in the task, let the automation evolve. Continued maintenance is something that you must consider, because your environment and process will be constantly evolving.

7.7.6 Automating complicated tasks

Simple tasks have a straightforward path and are easy to get started. But complicated tasks can be a bit more cumbersome because often they must retrieve information from one source to use as input for another. These tasks may have unknowns at the time of execution that makes determining responses or input values not quite as straightforward as they would normally be, but the value of these responses is still knowable with a limited set of variables.

For example, if you operate in a cloud infrastructure, simple tasks such as restarting a service can become complicated. Because of the dynamic nature of the cloud, you may not know at design time which servers are running the service you need to restart. There might be one; there might be a hundred. Before you can even begin the task of automating the restart, you must be able to identify where the work needs to happen.

This is another example of how developing and iterating over time might be a beneficial approach to reaping safety benefits, while at the same time moving the team forward. Maybe you leave the discovery of where to run the restarts to human effort, with your script merely prompting for a list of IP addresses to execute the script against. As you become more skilled and more comfortable with your automation, you may move to having your script query the cloud provider’s API to find the list of IP addresses.

When you’re working with complicated tasks for automation, you want to take a basic approach to dealing with the task. Always keep in the back of your mind that when it comes to automation, the safety of execution is always something to design for. You should evaluate two areas: the negative impact or consequence of getting a step wrong, and the ease of performing the success or failure of the step.

As an example, imagine you’re looking at restarting a process from a last known good position. Once you restart the process, it will begin processing data from the position you specify. If you get the position wrong, you could end up either skipping data by moving too far in the log sequence, or duplicating data by moving too early in the sequence and reprocessing data. You’ll mark as high the consequences of getting this step wrong. Next, you must get the log sequence number. If you’re lucky, there might be a system command that gives you the log sequence number. Figure 7.4 shows how easy it is to retrieve the log sequence number.

Figure 7.4 Output from the log sequence retrieve script

The likelihood of you getting this wrong is low. This looks like a good candidate to automate in its entirety. Now if this output was buried within a bunch of other output requiring you to do some arcane, unreadable regular expression-matching, that might change your position on the task’s difficulty, moving it to medium. The accuracy of the task being a medium, in combination with the high consequence of getting it wrong, might push you outside your comfort zone for automation to start with. You could rely on an operator to get the log sequence number manually and feed that step into your automation task.

The process for identifying potential automation levels follows:

Break the complicated tasks into a series of simple tasks.
Evaluate each simple task to understand the negative consequences of getting that task wrong. Rank it from 1 to 10.
Rank the difficulty of performing that task with certainty via automation or scripts. Rank the confidence on retrieving an accurate value from 1 to 10.
Plot these two lines in a quadrant to evaluate the overall level of difficulty.
Decide whether the task should be performed manually (prompted) or rolled into the automation based on your comfort level of where it falls in the quadrant.

These steps will become overkill as you can quickly intuit whether a task has particular risks or not. But if you’re just getting started, this can give you some guidelines on how to proceed with regards to your automation. Until you get comfortable, you can use the quadrant graph shown in figure 7.5 to plot your rankings. Based on the quadrant that the simple task ends up in should help you in deciding on how difficult or easy it will be to automate and the type of automation interaction you should use.

Figure 7.5 Plotting the risk involved with automation

7.7.7 Automating complex tasks

The automation of complex tasks is no small feat. It takes an experienced team that is dedicated to the automation systems being created. Though automating complex tasks is possible, I’ll reserve this conversation for later in the book, once you’ve gone through a few other topics. Depending on the size and experience level of the team, this could lead to creating silos of knowledge, where only a select few are anointed with the privilege of certain tasks.

Summary

Operational automation is a feature of the system and should be considered as early in the design process as possible.
Prioritize automation not just in the work you do, but also in the tools you choose.
Identify good candidates for automation by examining the task’s complexity and basing the automation approach on that complexity level.
Address skill-set gaps by leveraging other parts of the organization for assistance.

1.For more information on the Cynefin framework, see David J. Snowden and Mary E. Boone (2007), “A Leader’s Framework for Decision Making,” Harvard Business Review (https://hbr.org/2007/11/a-leaders-framework-for-decision-making).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7 The empty toolbox

Create new playlist

Sign In

Sign Up