Chapter 10. Automation

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10 Automation

Automation applied to an inefficient operation will magnify the inefficiency.

Bill Gates

The ability to deploy, configure, and troubleshoot production services quickly and easily through tools and automation is compelling. For the business side it promises the ability to rapidly experiment with different market offerings while also improving the organization’s ability to respond quickly to new or changing customer demands. For the technical side it can help increase the amount of work the team can complete while simultaneously making that work far less complex to perform and its results more predictable.

However, what many organizations miss is that really gaining the benefits of automation takes far more than simply installing a bunch of tools. For one, the best automation approach for one organization is likely to be considerably different from the best approach for another. In fact, ignoring these differences by using tools that do not match the conditions of your ecosystem can actually lead to more work and less delivery predictability.

Finding the optimal automation approach for your organization starts with first understanding the dynamics and conditions within your delivery ecosystem, then crafting tooling decisions that best fit them. As you will see, through this process you can build a positive feedback loop that both improves your ecosystem awareness and makes it more predictable and less labor intensive.

This chapter begins with an overview of the ways environmental conditions can impact automation effectiveness. From there, we explore how concepts such as 5S from Lean can improve awareness while reducing unnecessary variability. We also introduce the role of Tools & Automation Engineering and how such a group can dramatically improve your automation strategy, and ultimately expand the capabilities of your organization.

Tooling and Ecosystem Conditions

Companies want to automate for a variety of reasons. Some want to deliver and scale faster. Others want to reduce costs. There are even companies that really don’t care about any of it but want to be seen as having some sort of automation/cloud/DevOps strategy.

Those organizations that have some passion for automation often describe what they want as some mix of the cloud capabilities of Amazon, the service capabilities of Netflix, the resource fluidity of Spotify, the data mining capabilities of Google, and the continuous delivery capabilities of Etsy. They might not always be clear on all the details, but they usually want automation now, think that it is straightforward to get, and want to know what tools they can buy or download to make it happen immediately.

Companies that desire such automation sophistication are usually well intentioned but misguided in where to start. The biggest problem lies in the chasm between the conditions we see and those that need to be in place to deploy tools with any chance of achieving the desired objectives.

Mismatching tooling with ecosystem conditions is like using the wrong type of race car for a race. For example, a Formula 1 series race is designed as a contest of speed and agility. Cars that are driven in it are optimized for those dynamics, but as a result require impeccable track hygiene. One stray piece of garbage or a poorly surfaced track and someone is likely to get hurt. On the other end of the spectrum is off-road racing. While it is still a car race, it is a contest of endurance through rough and unpredictable conditions. Off-road vehicles have to balance ruggedness and speed, making them far slower but more durable than their Formula 1 counterparts.

Images — **Figure 10.1**
Bill’s misjudgment of race cars cut short his off-road racing career.

Automation works in much the same way. Some of the time the mismatch means that, like the off-road racer in a Formula 1 race, the tool ends up not being particularly useful. This happens, for example, when build and code repository practices aren’t aligned to installed automated build and continuous integration (CI) setups. Putting automation around a clunky, slow, or fragile build process tends to only add complication. The same is true for CI if developers check in code infrequently, there is excessive code branching, you are only building and integrating small parts instead of everything together, or the delivery process is heavily phase-gate. The automation is supposed to provide rapid feedback into the state of the entire code base. This is difficult to do if you are only occasionally getting feedback from a small percentage of this process.

Mismatches can likewise cause frustration or even additional wasted effort. One of the most common places this occurs is with monitoring and data analytics tools, where it takes little for a mismatch to have an outsized effect. On the service side, mismatches tend to take the form of inconsistent, inadequately documented, or poorly maintained interfaces and data, any of which can make it difficult to correlate and understand what is happening within and across systems. Heavy reliance on proprietary vendor-supplied software and services can be so impenetrable as to create a large analytic black hole in your ecosystem.

Often, mismatches happen because the software or services themselves were simply not designed with analytics or instrumentation in mind, leaving poor-quality data artifacts that contain far too much noise to use for surveillance and understanding. Other times, the problem is actually with the choice of monitoring or analytics tools and their configuration. I regularly see such tools become blind or errantly spew garbage because either they have been deployed in an environment they were not designed for or there has been some subtle change to the inner workings of a process that causes the metrics or data the tools once looked for to no longer be relevant.

The dangers of mismatches can be extremely hazardous. Just as Formula 1 cars need heavily controlled track conditions, some tools likewise are extremely sensitive to mismatches. But unlike Formula 1, it is usually the ecosystem that suffers.

This is particularly the case with automated deployment tools. The challenge of overly tight service or data coupling can make automated deployment difficult. However, the bigger problem comes from simply not having a clear and sufficiently consistent model of your deployment ecosystem. If deployment targets aren’t organized with high configuration hygiene and minimal unknown or unmanaged state drift, automating deployment greatly increases the chances for unforeseeable or uncontrollable deployment and configuration problems that are irreversible.

Not all mismatches are necessarily technical in nature. Some of the most difficult mismatches to deal with affect what and how people go about their work. Automated deployment and public cloud solutions can deeply affect established procedures and have very real legal and regulatory consequences. Adapting people and processes to new delivery approaches can cause real business headaches for organizations. If the new technologies and processes lack sufficiently clear and trackable governance controls and protections, any perceived shortfall can even limit what markets and customers can become customers. If automation choices are not made carefully, they also can make the jobs of staff far more cumbersome and error-prone, leading to pushback that can undermine the viability and effectiveness of the automation initiative itself.

Building Sustainable Conditions

So how do you go about finding and eliminating mismatches? Your first instinct might be to get an idea of what the conditions are within your ecosystem. But to do that you first need to know where to look.

Chapter 6, “Situational Awareness,” discusses how gaining sufficient context takes far more than just looking around. Everything from mental model biases to poor or misleading information flows can lead you astray. A much more effective way is to take a page out of Lean Manufacturing. Lean relies upon two desires that are commonly held by staff. The first is to be proud of your workplace and accomplishments within it. The second is that the work ecosystem is set up such that performing work the correct way is easier than performing work the wrong way.

How Lean accomplishes this is through an approach called 5S.

5S

Like IT automation, factory tools and automation only work well when the ecosystem is known and clean and the tools match what workers need to turn materials into the desired end product. The Lean method for organizing and maintaining an orderly workplace is called 5S. The traditional 5S pillars are sort (seiri), set in order (seiton), shine or clean (seiso), standardize (seiketsu), and sustain (shitsuke). They fit together like Russian matryoshka nesting dolls, with each subsequent pillar building upon and reinforcing those before it.

On the surface the names of these pillars might sound like irrelevant janitorial tasks to many in IT. What on earth gets “shined” in IT anyway? Don’t concern yourself with the terms themselves, as they are no more than a mere shorthand for something much bigger. Even manufacturers who only focus on the terms see little benefit. For both manufacturing and IT alike, it is the intent behind each term that is relevant. Table 10.1 provides a guide to help.

Table 10.1
The 5S Pillars

Pillar	Meaning	IT Equivalent
Seiri	Sort	Know and classify the artifacts within your ecosystem and remove those that are unnecessary
Seiton	Set in Order	Organize artifacts and workspace sensibly to maximize effectiveness and minimize uncapturable change
Seiso	Shine/Clean	Maintain/improve artifacts, tools, and workspace to ensure their utility
Seiketsu	Standardize	Build/maintain/improve common shared frameworks and processes that support end-to-end maintenance and transparency across ecosystem while encouraging continuous improvement
Shitsuke	Sustain	Build/maintain structural and managerial support to sustain continuous improvement

Let’s take an in-depth look at each of these 5S pillars to see how they can help us on our journey.

Seiri: Know What Is There

The first and most central pillar of 5S is sort (seiri). The purpose of seiri is to get a handle on all the bits that make up your environment, from the code, infrastructure, and other building blocks of the service stack to the tools and processes within it. It is not a simple inventory audit or discovery exercise. Those only provide a snapshot of what you might have in your environment. Seiri is necessarily continuous to maintain awareness of what you actually need and why you need it.

Finding which components exist in your ecosystem, what they do, who uses them, who owns them, where they live, as well as how they enter, move around, and leave the ecosystem is obviously useful. It improves your awareness of what is going on, helping you spot risks and potential mismatches that could cause problems that might complicate automation efforts.

Trying to identify the purpose of everything also helps you to spot those things that are unnecessary, no longer needed, or really shouldn’t be where they are. In manufacturing, seiri is seen as a useful mechanism to tackle waste. Organizations habitually collect all sorts of clutter. Often little time is devoted to removing it because either people are too busy or the effort simply doesn’t seem worthwhile. Some might even think they are being clever by leaving it around. Seiri emphasizes why ignoring or keeping “clutter” is the wrong approach.

By getting rid of everything that is not directly needed for delivery, you eliminate many of the places where things can hide as well as the noise that obscures your understanding of what is going on. A good analogy is clearing your yard of leaves or dry brush. Clearing a yard makes it easier to avoid tripping on your garden hose or stepping on a snake that has decided to take residence in your yard. Removal also has the added benefit of eliminating something like a piece of code or errant configuration that can cause catastrophic problems.

Seiri is a good way to get everyone thinking about the purpose of everything in the ecosystem, which not only improves everyone’s understanding but also inevitably encourages people to think of ways to improve. Seiri also greatly reduces the number of ownerless orphans that can lead to potentially disastrous surprises down the line.

Knight Capital

If seiri doesn’t make much sense or seems unnecessary to you, it might be worthwhile to walk through a rather famous example of how not doing it can lead to disaster.

Knight Capital was once a high-flying financial services broker-dealer in the United States. Like many such firms, Knight Capital had invested heavily in building their own high-speed algorithmic share-trading system. These systems are designed to find and exploit subtle price differences, or “inefficiencies,” within and between markets. There are lots of reasons that these differences can appear. Probably one of the more notorious reasons is the existence of “dark pools.” These are private exchanges that were originally established to allow for large institutional investors to anonymously buy and sell large blocks of securities (typically to rebalance their portfolios) without impacting the market. Later, dark pools became a pathway for trading firms to avoid exchanges and market makers entirely, sometimes even to profit at their customers’ expense.

In June 2012, the New York Stock Exchange (NYSE) received permission to create their own “dark pool” called the Retail Liquidity Program (RLP). Knight quickly acted to modify their internal trade router software, called SMARS, to change its order-handling process to integrate into RLP for its opening day on August 1.

The job of SMARS was to receive parent orders from the top-level trading platform and then figure out, based upon available liquidity, the best way to divvy up the orders to external pools and markets for execution.

Like many companies in the financial industry, Knight Capital favored new functionality over maintaining what already existed. This is especially true for front office functions like trading desks, where traders are constantly in search of any new feature that might give them a market advantage. This meant that there was a lot of old, dormant code lying all around. SMARS was no different. One of these was a piece of deprecated functionality called “Power Peg.” It had been designed to track cumulatively how shares were being fulfilled across markets to know when to stop trying to send out orders. Power Peg had been discontinued in 2003, with the tracking piece eventually being moved out of the Power Peg routine and into another function. But while this particular routine had been discontinued, the code not only was left in but was also still callable.

As you might imagine, leaving dormant code around that is still callable is incredibly dangerous. It is the software equivalent of leaving the safety off on a loaded gun left carelessly in a pile of stuff in the back of your closet. Any number of unassuming things could accidentally reanimate it, wreaking havoc in your environment. This is exactly what happened at Knight.

Knight developers were aware of this proverbial gun but clearly didn’t appreciate the full extent of the danger they faced. As part of the RLP work, they had decided to remove the Power Peg code so that they could reuse the flag that activated it for the new RLP code.

While removing callable code is always a good idea, doing it at the same time that you are going to be reusing a flag that activates it is like grasping around in the back of a dark closet for the loaded gun. Unfortunately for Knight, they had some other clutter-creating issues that left the barrel of the gun pointed right at them. One of these was that, like so many companies, Knight weren’t particularly good at keeping deployment configurations in a known and aligned state.

On July 27, shortly before the RLP launch, the new RLP SMARS code was scheduled to be slowly rolled out to the eight SMARS production servers to minimize disruptions. Unfortunately, one of the eight servers was missed and not upgraded with the new code, leaving a server where the Power Peg software had not been removed.

The fact that this server configuration discrepancy happened meant that there were no sufficiently effective mechanisms in place to prevent it. Lack of discovery suggests that any checks in place were either not thorough enough or not visible enough to bring this anomalous situation to anyone’s attention. This is hardly a deficiency developers want, especially when they are too busy or sloppy to remove dormant code.

One and a half hours before market open on August 1, Knight received orders from customers who were eligible to use RLP with the repurposed flag. While the seven servers that received the new code processed the orders correctly, the orders activated the Power Peg code still present on the eighth server.

At this point Knight still could have recovered. The conditions on the eighth server triggered a series of 97 alert errors notifying that there was a problem. As this was still before market open, there was a possibility that the alert could have triggered a response to investigate, find, and potentially even fix the problem before the markets opened. Unfortunately, the alerts were missed, exposing yet another likely issue of clutter obstructing awareness and action.

Instrumentation is incredibly important. If done well, it can help considerably by giving you a much greater understanding of what is going on in your environment. Missing an early alert of a potential problem is never good. Missing 97 of them over the course of 90 minutes and then still being in the dark as to the cause well after disaster strikes hints at massive amounts of dysfunction with the way that instrumentation was set up, how it was used, and how it was interpreted. While it is possible that alerts were not going to the right place, more likely there was so much noise that context, and thus value, were lost. Anyone who has been on the receiving end of a noisy monitoring system knows that after a while you simply ignore everything.

At market open the Power Peg code proceeded to wreak havoc. As the dormant code was no longer able to track whether orders were filled, it continued to replicate the orders millions of times. The administrators, still unaware of the configuration discrepancy, responded (as many would who want to stop the bleeding without knowing what is going on) by rolling back the code on the other production servers. This, of course, made the problem worse. The flag was in the orders, so rolling back resulted in all eight servers having the problem. This hints that there was either limited under-standing of how the code worked or little assessment of the risks involved in the approach taken.

By the time the errant code was stopped 45 minutes later, it had executed over 4 million orders in 154 stocks for more than 397 million shares, leaving it net long in 80 stocks with a position of $3.5 billion and short in 74 stocks for about $3.15 billion. In the end, Knight Capital lost over $460 million in unwanted positions.

This knowledge breakdown ultimately crippled Knight, leading to not just the $460 million loss, but ultimately its demise as an independent entity.

Seiton: Organize So It Is Useful

Knowing what you have and getting rid of what you do not need is a great way to avoid becoming the next Knight Capital. But knowing you have something is only useful if you know where it is and that it is available to use when you need it. That is where the second pillar of 5S (seiton) comes in. Seiton is about organizing the various parts of your ecosystem in a sensible order. In a factory this means creating a well-laid-out workbench with each item easily locatable in the most sensible and convenient place for it. Organizing everything logically improves flow and accuracy and keeps users of these resources aware when something is amiss.

For those who have never spent any significant time in a well-organized environment, this probably seems more like a nice-to-have than an important requirement. It might be a little frustrating to have to hunt for things from time to time, but organizing takes time away from other, more important tasks you might make money from.

The reason these beliefs endure is that this same lack of organization obscures its actual cost. A useful way to demonstrate this is by shifting this disorganization and lack of care to something where the dangers are far more obvious.

Imagine for a moment that you have to go to the hospital for minor surgery. You are directed to a room and told to get up on the operating table. As you enter the room, you find it strange that there is a pile of surgical tools in various states of repair heaped in a pile on the floor in the corner of the room. A short time later the doctor comes in with a nurse behind him. The doctor asks you to lie down as he starts to pick through the pile.

After some time, the doctor reaches into his pockets, pulls out a couple of old spoons and a steak knife, and says “I’m glad it was soup and steak day in the cafeteria! I can’t find the right surgical equipment, but these should be good enough for the job.” During this time the nurse has been filling a syringe from one of the many vials on the cart. As he sticks it in your arm he says “All the drugs look the same. This one is closest to me so it will just have to do!”

Such a scenario would likely have you desperately trying to run out the door. The risk to your well-being is both high and obvious, and one you would likely only tolerate in a disaster situation where your life was on the line and there were no other options.

Yet hospital environments do not naturally organize themselves. While you might not find surgical equipment on the floor, keeping equipment and medicines organized in a way that minimizes mistakes is still an ongoing challenge. In fact, the Institute of Medicine’s first Quality Chasm report stated that medication-related errors accounted for 1 out of every 131 outpatient deaths and 1 of 854 inpatient deaths, amounting to more than 7,000 deaths annually.¹

1. Institute of Medicine. To Err Is Human: Building a Safer Health System. Washington, DC: National Academy Press; 1999, p. 27.

Organizational problems are one of the most common causes of medical errors in hospitals. Many medications have look-alike/sound-alike names and are stored in hospital pharmacies in identical containers. This is why hospitals increasingly use seiton and other 5S techniques like differing the color and shape of medication containers; organizing medical carts and cabinets for separation to reduce the likelihood of mistakes; and using barcodes and other secondary checks to flag errors before they cause life-threatening problems.

While not always so obviously life-threatening, disorganization in IT can be just as dangerous. When under pressure to deliver and code is difficult to work with, tools are new or cumbersome, or establishing a working environment takes too long, people will inevitably cut corners. Just like the doctor using cafeteria utensils, the sloppily copied code, poorly managed dependencies, or repurposing of some random or less suitable deployment target might seem “good enough” to do the job.

While seeming carelessly reckless to outsiders, this mentality exposes the organization to unnecessary variability and risk. This makes code harder to read and work with and problems harder to troubleshoot, often wasting more time and effort than is saved through the original shortcuts. For automation, such variability creates additional edge cases that those deploying automation tools will inevitably need to uncover and deal with to ensure their efforts are both useful and work reliably. When tools or even large sections of the supply chain are of unknown quality or provenance, as occurred in the SolarWinds and codecov2 security breaches, these unknowns can even introduce outright dangers to your business.

Just like hospitals, seiton efforts in IT should be designed to make it both easier and more worthwhile to do things the right way than to use shortcuts. Often this means exposing and fixing the root causes for dysfunction. For instance, poorly laid out code repositories and excessive branching often lead to check-in mistakes and painful merges that increase mistakes and consume time to sort out. These problems are often caused by unfamiliarity with version control tools and their best practices; poorly designed and implemented code; and inadequately defined and managed work. Sloppy coding and poor dependency management are often a result of a combination of misperceptions about the problem space; unrealistic time and resource constraints; and hidden or insufficiently understood existing technical debt.

This is the same case on the operational side. Slow and cumbersome provisioning processes, poor software packaging, and inconsistent or manual installation and configuration not only are frustrating but encourage people to develop workarounds that create yet more variability that must be dealt with. You can break the loop that causes the problems in the first place by organizing your ecosystem through improved packaging, configuration, and provisioning, along with measures that reduce the need to log into deployed instances.

Disorganized Dependency Management

Few IT professionals would doubt the importance of dependency management. Yet, there are even fewer who actually spend the time to ensure that dependencies are being managed adequately.

This disparity between importance and execution ultimately took center stage at one large global company. They had decided to rewrite one of their core services. It had grown organically over the years into a difficult-to-support mix of several languages and technologies. The desire was that cleaning up the stack and rewriting it in a single language (Java) would make the stack more supportable, extensible, and reliable.

Like many large projects at big companies, it quickly grew to contain hundreds of software developers in over 20 teams scattered around the world. This might not seem terrible at first. The problem was that they were all working on the same overall service, and many of the parts coming from different teams were closely coupled, if not outright tight dependencies, with each other. What made this worse was that each team had decided to use their own Maven instance. This meant that there were over 20 different project object models (POMs), which meant over 20 different versions of dependencies.

The problem with how Maven figures out library dependencies, especially at the time, made this problem worse. When there were references to two different versions of a library, unless you explicitly told it otherwise, Maven tended to pick the one closest to the root of the dependency tree. This was tricky to always get right even when you knew everything that you were pulling in from both public and private repositories was backward compatible, as library versions could still shift from build to build, opening up the possibility of all sorts of new variability.

Add 20 teams not talking to each other and with no certainty of backward compatibility? At integration time you have the equivalent of a 100-car crash. Nothing could consistently build, let alone work, and as a result the project ground to a halt.

A dedicated team was put together to unravel the problem. What made the situation especially bad is that there were many cases where conflicts cascaded, involving multiple conflicting versions across tens or hundreds of components and multiple teams. At times it felt like searching for a needle in a haystack.

Teams were up in arms as many months of work had to be rewritten or thrown away to overcome the conflicts. In some cases, teams had to be dissolved or work shifted around to minimize further problems. In the end, the project never totally recovered and was declared a massive failure.

Seiso: Ensure and Maintain Utility

The third pillar of 5S, seiso, is about creating mechanisms and a culture that maintains and improves the organization, health, and utility of the artifacts and processes in the ecosystem. Seiso reinforces seiri by helping build the mechanisms that keep unnecessary clutter away, and reinforces seiton by keeping those organization efforts effective.

Maintaining the health of your delivery ecosystem seems like a reasonable thing to want to do. We know that everything from our car, our computer, our house, and even our own body needs some care and attention from time to time to stay in reasonable working order. Sadly, this is hardly the case in IT. Everything from code and systems to the processes that deliver and operate them are habitually neglected. Bugs and outdated or suboptimal processes are frequently deprioritized or ignored, allowing them to linger until they cause a major event that makes addressing them absolutely necessary. The same goes for systems. Once they are in place, they tend to be rebuilt from the ground up only when they absolutely have to be, and then often grudgingly.

Some argue that the main culprit is the sense that maintaining and improving is throwing more money away on what should be an already solved problem. It is easy to blame managers and modern business practices like zero-based budgeting that tend to favor new features and products with seemingly clear yet unproven returns on investment over trying to tackle existing buggy or unreliable setups with far more opaque benefits. But as anyone who has had to argue to prioritize a bug fix or tried to encourage team members to proactively refactor code or rebuild systems knows, development and operational teams are just as likely to prefer working on the new and shiny over caring for what is already there.

The rapid and continuous evolution of IT ecosystems makes it particularly important to tune and maintain the systems, software, and processes that they are composed of. This is true even in relatively stable and simple environments with few changes in requirements. Technology shifts still regularly make technology stacks obsolete. Hardware and software alike become increasingly expensive to maintain as vendors declare older versions to be end-of-life (EoL). But that is hardly the toughest problem. Many banks and large businesses with legacy IT systems can tell you of the high expense and difficulty involved in finding people who both know and are willing to support long-obsolete technologies. No one likes to try to take on supporting a system that is old and cumbersome, especially when those who built it are long gone and the organization has no interest in improving or modernizing what is there.

There is also the problem that long-lived legacy systems tend to collect so much clutter that staying sufficiently aware of what’s on the systems and how it all works becomes extremely difficult. This creates the worst of all scenarios, which is a progressively more difficult to support component that becomes increasingly difficult to improve or replace without unknowingly breaking something important.

Implementing seiso doesn’t mean constantly chasing the flavor of the month, but it does require staying within the mainstream by establishing a culture that takes ownership and pride in the health and well-being of the ecosystem. Encourage people to point out problems and try to improve the ecosystem in a way that both tackles the problems and is aligned with the outcomes the organization is trying to achieve. This is where the learning kata and kaizen concepts (introduced in Chapter 7, "Learning") come in handy. They help people look at problems and, rather than take on and solve a huge problem all at once, try to incrementally improve and learn in a more sustainable way.

To get there, however, you need more than excited people and a process. This is where the last two 5S pillars come in.

The Mystery System

There are lots of stories of various banks, airlines, and telecommunications companies with long-obsolete systems causing hugely embarrassing outages. Some of the more well-known ones include the RBS stable of banks (RBS, NatWest, Ulster Bank, etc.) incident in which people lost access to their accounts for days;² the crash of the Sabre flight reservation and booking system used by various airlines;³ and the many British Airways flight operations IT crashes.⁴ While those companies were very unlucky, they are far from the only ones that find themselves carrying such legacy risks.

2. https://publications.parliament.uk/pa/cm201213/cmselect/cmtreasy/640/640.pdf

3. https://techcrunch.com/2019/03/26/us-airlines-computer-issues/?guccounter=1

4. https://edition.cnn.com/2019/08/07/business/british-airways-london-flights-canceled/index.html

Early in my career I worked for a company that built trading systems for investment banks. Not only did we have our own proprietary core technology, we also deeply integrated our products into the various trading, compliance, back-office, and reporting systems of our customers. It was a great learning experience. It also exposed me to all sorts of lessons of what not to do.

I found myself in a very odd situation several years after I had moved away from working in the financial industry. I was invited to visit a well-known investment bank to discuss some distributed software development challenges they were having. The office that I was visiting had once been that of another investment firm that I had spent some time with. What I didn’t realize was how much of a stroll through memory lane it would end up being.

On our way to the conference room, my hosts took me through the trading floor. Trading floors are often pretty impressive, with lots of monitors and technology all around and fast-paced action. As we were walking through, I recognized a couple of aging HP 9000 servers that I had once worked on. I commented that I was surprised that they were still there.

That was enough for everyone to stop. One of my hosts asked what I knew about them. I found that they had tried a couple of times to extract them, but found that each time various critical back-office jobs would fail in unexpected ways. It was such a mess that they simply left them in place.

Fortunately, I was able to explain what they did while my hosts took copious notes. I was so amused by the situation that I asked for the console. From there I typed in my old login credentials and, sure enough, I still had access.

Seiketsu: Transparency and Improvement Through Standardization

Even though the entire ecosystem benefits from the first three 5S pillars, their primary focus tends to be on organizing and improving things at the function level. Seiketsu’s primary purpose is to remove the variability that gets in the way of understanding the state of the ecosystem well enough to be able to confidently deliver predictable and reliable results. This improves cross-ecosystem transparency, while also building up a level of complexity-cutting awareness and familiarity to help different functions work more effectively with each other.

It is important to note that these standards are not the traditional “best practice” rules that are created and enforced from the top down. Instead, they are developed and agreed upon by those in the trenches to reduce the amount of variability and “noise” that makes it more difficult to deliver the target outcomes.

There are many excellent places to implement seiketsu in the delivery ecosystem. On the development side, seiketsu usually is implemented in tools that enable any developer to work effectively on any piece of code and in any project or team with little ramp-up time. These tools include such things as coding style standards, standard logging and data structure formats, observability instrumentation, and build and development environment standards. While not always implemented perfectly, these are fairly common in very experienced enterprise software development organizations.

On the delivery and operational side, the most important places to implement seiketsu are in software packaging, deployment, and environment configuration management. Having clean and predictable configurations and flows that are authoritatively known and easily reproducible eliminates many of the variability problems that delivery and support teams face. Clean configuration management also makes it far clearer when and where problems crop up by eliminating the noise where they can otherwise hide.

More controversially, development and operational teams should also try to use the same repository and task management tools, if not the same repositories and task queues (with, of course, the appropriate permissions scheme in place to manage who can do what where), when it makes sense to do so. When there are vastly diverging needs between teams, this might not be possible. However, often teams choose different tools and repositories out of personal preference rather than out of any real need, often unknowingly introducing unnecessary friction and risk into the delivery ecosystem as a result. Such teams inevitably begin to struggle to manage work or collaborate whenever code and tasks need to move back and forth between teams and their different systems. Not only does it take more effort to keep the systems in sync, it also increases the risk that tasks and code might be mishandled or distorted in ways where situational awareness becomes damaged.

The alignment that seiketsu creates can streamline work considerably, providing two benefits:

It discourages doing something the wrong way because it takes more effort than doing it the right way. For instance, creating a deployment-ready atomic software package means less effort to manually install, troubleshoot, and remove the software on its target than a more irreversible “spray and pray” approach.
Streamlined standardization can enable locally specific elements, such as new automated test harnesses and business intelligence tools, to be easily added to preexisting pipelines across the delivery lifecycle.

Processes can also follow this approach. Teams can agree to provide standard interfaces to improve information/code sharing, collaboration, and timing in ways that still allow internal flexibility. This is the intent of both the Queue Master and Service Engineering Lead positions. In both cases, this approach improves predictability and reduces friction in the flow.

It only takes a little bit of forethought to start to standardize in such a way as to allow you to start linking the more critical tools and processes together. Linkages also do not all need to happen at once, or everywhere. They should happen in those places where they can provide a lot of value, in the form of either new insights or smoother and less error-prone delivery.

The Problem with “All-in-One” Tool Suites

Figure 10.5
Like a Swiss Army knife, all-in-one tools often have to trade features for effectiveness.

A key to making standardization work is that it feels sensible and “right” for those using it to do it that way. One of the big challenges I repeatedly encounter are tools and processes that fail dreadfully in this regard. Sometimes people simply lack sufficient training or otherwise find themselves struggling. When this happens it is unfortunate, but it is fairly easy to catch and correct. More frequently, the enforced “standard” is the problem.

A common reason why attempts to standardize fail is that vendors try to lock customers into their tooling ecosystem by selling them a whole suite of tools, along with “best practice” processes that try to commandeer the entire service lifecycle. Usually, the journey begins when the customer is attracted to one or a handful of genuinely useful tools. When they engage the vendor, the customer soon finds themselves being pushed into purchasing a whole suite of tools and processes for their desired tools to work “optimally.” More often than not, the additional tools and processes are weak, cumbersome, or simply do not match the needs of the customer. What is worse is that many times the suite extends far beyond the scope of the purchasing organization, forcing others to choose to use a bad tool or use a parallel set of tools and “double key” as necessary.

Application Lifecycle Management (ALM) and IT Service Management (ITSM) vendors are the most egregious users of such bundling techniques. Their solutions often try to assume control of all aspects of the ecosystem by tightly integrating ticketing, repository, configuration management (CMDB), deployment and environment management, automation, monitoring, and reporting tools and the processes around them into a single end-to-end bundle.

From an outsider’s perspective, having one nice out-of-the box solution that takes care of all aspects of the delivery lifecycle may sound appealing; however, rarely do such solutions work well for everyone. Even though teams across an organization need to coordinate and pass work between each other, they often have very different needs. Often these variations go beyond role distinctions and arise from unique challenges the organization might face. The tools and best practices in all-in-one solutions inevitably generalize team needs in ways that ignore these differences. Often this adds unnecessary delivery friction for no reason other than to fit the solution or, worse, distorts or strips out ecosystem information that teams need for effective decision making.

In one such case, an organization had rolled out a tool suite that attempted to impose unnecessary change gates across an already streamlined CI/CD pipeline. In another, the tooling forced the organization to choose between replacing a fully automated delivery pipeline with one that required manual manipulation of the tooling suite every time code needed to be launched or breaking the configuration management database (CMDB), and with it, the monitoring and ticketing systems.

For that reason, I usually strongly dissuade organizations from going “all in” on one suite. I have seen many teams fix seemingly intractable dysfunction by moving away from fancy tools that didn’t work for them to either “dumb” ones like Bugzilla or sticky task cards on a wall that did.

This problem is not unique to IT. Manufacturing shops that try to adopt Lean techniques often find that they produce better products faster and cheaper with simpler tools than with slicker but far more inflexible “all-in-one” tools.

Shitsuke: Building a Support System to Ensure Success

Shitsuke, the final pillar of 5S, focuses on building a culture that maintains the structural and managerial support needed to sustain the previous four pillars. Shitsuke is the recognition that work and improvement initiatives are not sustainable when done in total isolation. They need a structure that encourages awareness, cross-organizational cooperation, a sense of progress and belonging, and learning. While everyone plays some role, it is management’s commitment that is most crucial. By understanding and supporting each of the 5S pillars, management can create the conditions that allow those on the ground to sustainably communicate, collaborate, and continuously strive toward ever better ways of delivering the goals of organization.

Managers are uniquely positioned to shape how information flows, is understood, and is acted upon. Through their peers and superiors they have the potential to see and help with the coordination of activities across several teams. Management can facilitate information flow and cross-organizational alignment, which helps with everything from knowing the state of the ecosystem to organizing, improving, and creating sensible standards to help meet target outcomes.

Managers also tend to either be part of or have more access to executive leadership, enabling them to play a role in shaping strategies that define organizational objectives and target outcomes. They can do this by improving executive awareness of conditions on the ground, as well as by ensuring that they convey the intent behind outcomes accurately to those in the trenches.

All of this works if these same managers create and nurture the conditions that encourage people to set up the mechanisms that allow information to flow. This means cultivating a culture of safety and trust. People at all levels need to feel comfortable speaking up without fear of retribution, be willing to challenge assumptions, and try new things in the pursuit of the target outcome. This is true even if they later prove to be wrong, as long as the lessons learned from the experience are shared and accepted.

Micromanaging how work needs to be performed, encouraging zero-sum rivalries between teams, and punishing failure can certainly quickly destroy any sense of safety and trust. But so can more-innocent actions like rolling out a new tool, technology, or process without at least trying to include those who have to use it in the decision process somehow. This isn’t to say that everyone needs to agree. However, if you are unable to convince many who are affected (skeptical or otherwise) how this 5S approach creates conditions that help them better achieve organizational target outcomes in some demonstrable way, trust will break and the solution will never achieve its potential.

Seeing Automation 5S in Action

Now that we have walked through each element that makes up 5S, how does it all come together to help with IT automation? There are lots of examples from every part of the service delivery lifecycle that can be drawn upon. I have chosen two examples from my own experience. One is an on-demand service start-up. The other is a large Internet services company.

The Start-up

Start-ups can rarely afford the luxury of hiring large numbers of staff before they have proven their offerings have found a profitable market where they are in demand. This is especially true when you are an on-demand online service start-up, where demand can rise and fall quickly. Such companies are often the perfect place for all sorts of automation. Not only does automation reduce the need for large numbers of staff to deliver and scale services up and down quickly, but when they're done well, customers are unlikely to worry about service failures from provider staff burnout.

It is relatively easy for a new business starting from scratch to automate everything from the start. However, without taking the care of implementing 5S principles, it is easy to damage your ability to remain quick and lean once services are live and in use with active customers.

I was brought in to run a major part of the delivery and operations at a start-up that had managed to work itself into a situation that made service delivery automation difficult to maintain and grow. Not only had the software become extremely complex, with tight coupling between components throughout the service stack, in the pursuit of new business, customers had been allowed to heavily customize their instances. For some, this customization went all the way down to the hardware and operating system. The software customizations also went deep into the code, to the point where some customers had their own code branches.

All of this customization was made worse by the fact that customer software stack versions were not kept in a tight, consistent band. A new customer would often get the latest version of the software stack, while older ones would have to wait until there was both agreement from the customer and capacity by the delivery team to perform an upgrade.

Delivering upgrades was also difficult. The complexity and variability meant that there was little possibility for consistency between development/test and production environments. This was so bad that the software rarely would work as expected, if it worked at all, when deployed. This led to constant firefighting and manual hacking in production. This made it nearly impossible to know how anything was configured, let alone reproducible.

Putting in tooling that would be useable and deliver predictable results meant straightening out all this mess. The only way to make such a clean-up both work and stick was to somehow in the process make it far easier for everyone to take the same organized approach than to continue the freewheeling approach they were currently using.

To find a path out of the situation the start-up had found itself in, my team and I decided first to compile a generic stack configuration that could be used as the most basic starting recipe (a form of seiton). If we were clever, we could change the way we organized our environment, separating out elements into generic, consistent building blocks. Any customer-specific differences would then be added as needed in a trackable way at assembly time. For instance, we knew that a web server was made up of a Linux OS instance, an Apache web server, and some modules. The modules might differ depending upon the specific customer, but these could be added or subtracted from the build manifest at install time. Configurations were also typically files that could be templated with customer- or instance-specific differences being defined and added as necessary.

We then built standardized versioned packages of that generic base and automated the deployment and configuration of it. Because we did this before such tools as Terraform, Puppet, Ansible, or Chef were available, we used a souped-up version of Kickstart that pulled from repositories that held the packages and configuration details.

We then set about to convince the rest of the engineering team to organize and standardize their ecosystem.

We started small by building an automated continuous integration system with which we regularly built and packaged some of our tools. One of our early tools was the ability for a developer or tester to “check out” a fresh base instance built from our automated deployment tool.

Up to this point building and refreshing an instance to work with was slow and laborious. A tool that could build out a standard instance quickly and reliably was a huge timesaver. Developers quickly wanted to have more of the stack included in the build in order to further speed things up. What we demanded in return was the following:

All the software needed to be cleanly packaged so that it could be installed, configured, and removed atomically (seiso).
Each software configuration file that was to be automatically installed had to have any customizable values tokenized so that the tool could fill in at install time.
All customized values would be generated based upon the rules of the roles that instance was a member of. So, a development Linux web server instance in Denver would be a member of a development role, a Linux role, a web server role, and a Denver role and would acquire the software and have configurations generated from the rules associated with those roles.

It did not take long for most of the stack to be onboarded. As developers were also very keen to quickly reproduce customer configurations, developers and managers soon became willing to spend the time cleaning up and standardizing those configurations as well.

As customer instances were steadily cleaned up and standardized, we onboarded them into our management tooling. Difficult hardware and software configurations steadily melted away. In most cases customers were eager to make the trade for the benefit of being on instances that could be built, upgraded, and scaled up seamlessly in little time. Occasionally, a customer resistant to being onboarded might be ditched if company leadership decided that it was more cost-effective to do so.

As we had built the automation the way we did, the configuration repository was, in effect, an authoritative source of truth for the configuration of anything that it managed. This meant that we could run a “diff” and see the configuration differences between environments. We could also trace dependencies, and use this information to drive monitoring and other supporting mechanisms. Being versioned also allowed us to look at who made what changes to what configurations when and for what reason. This could be used for everything from auditing to troubleshooting to find the causes of perceived performance or behavior changes to services over time.

The power of having an organized representation of our ecosystem soon became obvious, as did the problems with any sloppiness in definitions or packaging. Once we were able to describe everything in the ecosystem as a member of so many roles, it was like a curtain had been pulled back. We could suddenly see and understand the state of everything. We started to eliminate unnecessary customizations and configuration drift. We could also use the repository to drive the entire service delivery lifecycle.

The Large Internet Services Company

A couple of us who had worked at the start-up later moved to a much larger established Internet services company. While the company had been extremely successful, for years they had organically grown, adding services by hand. There were various automation tools around, and even a centralized tooling team responsible for building out additional ones. Because there were so many unknowns and lots of disorganization, however, the company found automation to be difficult coupled with unpredictable results.

Shortly after we joined, the company faced a crisis. They had signed some large deals with stringent service level agreements that they had little idea how to meet. We saw this as an opportunity and got to work.

Our earlier experience had taught us that we needed to get a handle on what was there (seiri), organize (seiton), and standardize (seiketsu) where we could. Helper mechanisms like the development environment building tool we had put in at the start-up, along with encouragement from management (shitsuke), would help attract people to use, maintain, and improve upon the mechanisms we were putting in place (seiso), allowing us to not just improve our ability to meet service levels but also to offer better automation tools.

Even though there was a lot of pressure to progress, we still knew that we needed to start out small to succeed. We picked some services from one business unit to begin with. One person looked at ways of incorporating the learning we had from the start-up to build a workable configuration repository that could drive automated operating system installations and software deployments, while another looked at retrofitting existing automated installation and deployment tools, as well as making improvements to software packaging.

While this was happening, a second group sought to identify what was in place. This included more-knowable information, such as what hardware was in place, as well as useful but less accurate information, like what software and services were on the hardware, how they were configured, and who used them. Interestingly, it was this last part that proved to be the most difficult. There were times that we had to disable network ports and wait for someone to complain before we knew what something was.

Once we had some ability to build, deploy, and track in a far more organized way, we moved all new deployments to the new setup. With the information that we had collected, we also could begin to look for opportunities to rebuild what was in place. While there was some initial resistance, much of this faded when development and support teams saw the benefits that were gained by putting in some additional effort.

Once we had a demonstrably working model in one business unit, we started to reach out to others to transform. This continued until we had brought over most of the organization. Along the way we added new automation tools and capabilities such as full CI setups, automated deployment and test frameworks, and continuous delivery pipelines and reporting tools. All the while, any remaining legacy unknown became increasingly isolated from systems and services that had been brought into the new management framework.

Even with all this success, we still encountered some teams that simply refused to use the tool no matter what the benefits were or how much help was offered to make the transition. This was recognized as far from ideal by those at the highest levels of the organization. Eventually, the company decided to make adopting the tool a requirement for being in a production datacenter. They did this by opening new datacenters and allowing only those who were using the tool to migrate there. They then announced one by one the closure of the old datacenters and gave the resistant teams one last chance. It did not take long after the first one closed before the remaining teams onboarded.

Tools & Automation Engineering

Building sustainable conditions is critical for establishing a sufficiently sound foundation for any automation to both remain useful and have predictable results. As in our previous examples, making conditions sustainable might even form much of the initial direction for any automation efforts. But getting automation that is optimally fit for helping the organization both build sufficiently sustainable conditions and successfully pursue its target objectives does not come magically on its own. It needs someone sufficiently skilled to deliver it. That is where the role of Tools & Automation Engineering comes in.

The concept of Tools & Automation Engineering comes out of the recognition that even when people have both the desire and all the requisite technical skills to deliver automation, they might not always have enough time or objective mental distance to do it. Having one or more people dedicated to crafting and maintaining the right tools for the organization helps overcome this challenge. Unless you have a relatively large organization, particularly one that is also geographically distributed, the number of people required for this role is never very high. Even one or two people is often sufficient for most organizations.

The best Tools & Automation Engineers tend to talented scripters and programmers with sizeable exposure to the operations world, including operating systems, networking challenges, and troubleshooting live services. These are often the same sorts of people who naturally believe in the concept of “delivery as code.” They do not necessarily have to be pure operations people. A fair number have a release engineering, backend engineering, or whitebox testing background that has straddled the development and operational world through their work.

This deep exposure to the operational side is very important. It not only enables Tools & Automation Engineers to grasp the concepts necessary to understand and automate the service ecosystem effectively, it also positions them to create solutions that integrate with operational and infrastructure tooling in ways that are more resilient to operational failure conditions than more conventional programmers.

Organizational Details

Tools & Automation Engineers are often the most effective when they are organizationally situated with those responsible for operationally maintaining the “ilities” of the service stack. The reason for this is that the operational environment forms the foundation of the customer’s service experience. By staying intimately aware of operational service conditions, Tools & Automation Engineers can build in the mechanisms that support the 5S pillars at the point where they matter most, and then work backward through the delivery lifecycle to improve the delivery team’s ability to build and maintain the desired service “ilities.”

In organizations that separate development and operations, having Tools & Automation Engineering within the operations team helps Tools & Automation Engineers stay attuned to operational conditions. It also gives operations staff a means to get help to create capabilities they need to automate and support the production ecosystem but either do not have the skills or the time to deliver themselves. Being both aware of operational conditions and well versed in software development, Tools & Automation Engineers can also aid Service Engineering Leads in communicating and bridging between development and operations.

In organizations where delivery teams build and run software, Tools & Automation Engineers provide cloud and CI/CD expertise for the delivery team and ensure that there is dedicated capacity to build and support delivery and operational automation. This is important for preventing tooling designed to support 5S pillars from being deprioritized in favor of new service feature functionality.

Workflow and Sync Points

While Tools & Automation Engineers participate as part of the team responsible for the production service, they are intentionally not part of any on-call, Queue Master, or Service Engineering Lead rotation. This is done to ensure that they can respond to any threats to the 5S pillars in the delivery ecosystem. However, despite this difference, they are still an integral part of team workflow and sync point mechanisms.

As you will see in Chapter 14, “Cycles and Sync Points,” being an important part of helping improve and stabilize Service Engineering maturity means using the delivery team’s workflow and participating in its synchronization and improvement mechanisms. This might seem a bit odd at first, especially as tools and automation work tends to look a lot more like typical development work than operational work; however, besides improving shared awareness, Tools & Automation Engineering plays a very important role in energizing improvement efforts.

Most of the tools that Tools & Automation Engineering end up building are heavily influenced by improvement needs that arise from the team’s retrospectives and strategic review sessions. The reasons for this quickly become apparent. For instance, it doesn’t make sense to automate the deployment of more services if the ones that you have already automated aren’t working reliably. The same goes for instrumentation and troubleshooting tools, build frameworks, and repositories. If any of these is too brittle or unreliable, then creating more of the same will only make things worse.

By participating in retrospectives and strategic review sessions, Tools & Automation Engineers can learn more about the pain points experienced in the operational environment with much deeper context. They can use the sessions to ask questions and propose further exploratory work to surface root causes and come up with tooling solutions to help the team overcome them.

Being part of retrospectives and the workflow also allows Tools & Automation Engineers to spot friction and risk points in work the team is receiving and performing that might be reducible through new or augmented tooling. For instance, a developer or operational person might get used to the packaging and deployment of a particular service being slow and cumbersome. A tooling person can look at the problem with a fresh set of eyes and potentially find a way to fix its root cause.

For Tools & Automation Engineering, work tends to behave quite similarly to a development Kanban. Most work is usually known going into the planning meeting, where it can be prioritized and ordered in the Ready column for tools engineers to pick up throughout the week. Work items can come into the workflow in an expedited way for tooling engineers, though just like expedited Service Engineering work, the same general post-expedited reviews should still take place to see if there are ways to minimize the need for such actions in the future.

While there is some flexibility to allow work items to be somewhat larger in size than is allowed for other Service Engineering work, in general most of it should still be sized to be no more than one or two days’ worth of effort. This both helps improve exposure of what is actually going on and keeps items moving across the board.

Summary

Delivering effective automation is an increasingly critical element needed to ensure the success of modern IT organizations. If done well and in collaboration with the delivery organization and management, it can help improve the organization and maintainability of the infrastructure, software, and services within it. But simply putting tools in place is neither a safe nor sustainable approach. The 5S pillars from Lean of seiri (sort), seiton (set in order), seiso (maintain), seiketsu (standardize), and shitsuke (sustain) allow you to build the situational awareness and predictable structures that successful automation efforts require.

As the delivery team improves their knowledge and management of the delivery ecosystem, they should consider creating a dedicated Tools & Automation Engineering function that sits within the operationally oriented Service Engineering area in order to provide tooling and automation solutions that help the organization meet the target outcomes of customers with an operational context.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10. Automation

Create new playlist

Sign In

Sign Up

Chapter 10

Automation