Chapter 10

Automation

Automation applied to an inefficient operation will magnify the inefficiency.

Bill Gates

The ability to deploy, configure, and troubleshoot production services quickly and easily through tools and automation is compelling. For the business side it promises the ability to rapidly experiment with different market offerings while also improving the organization’s ability to respond quickly to new or changing customer demands. For the technical side it can help increase the amount of work the team can complete while simultaneously making that work far less complex to perform and its results more predictable.

However, what many organizations miss is that really gaining the benefits of automation takes far more than simply installing a bunch of tools. For one, the best automation approach for one organization is likely to be considerably different from the best approach for another. In fact, ignoring these differences by using tools that do not match the conditions of your ecosystem can actually lead to more work and less delivery predictability.

Finding the optimal automation approach for your organization starts with first understanding the dynamics and conditions within your delivery ecosystem, then crafting tooling decisions that best fit them. As you will see, through this process you can build a positive feedback loop that both improves your ecosystem awareness and makes it more predictable and less labor intensive.

This chapter begins with an overview of the ways environmental conditions can impact automation effectiveness. From there, we explore how concepts such as 5S from Lean can improve awareness while reducing unnecessary variability. We also introduce the role of Tools & Automation Engineering and how such a group can dramatically improve your automation strategy, and ultimately expand the capabilities of your organization.

Tooling and Ecosystem Conditions

Companies want to automate for a variety of reasons. Some want to deliver and scale faster. Others want to reduce costs. There are even companies that really don’t care about any of it but want to be seen as having some sort of automation/cloud/DevOps strategy.

Those organizations that have some passion for automation often describe what they want as some mix of the cloud capabilities of Amazon, the service capabilities of Netflix, the resource fluidity of Spotify, the data mining capabilities of Google, and the continuous delivery capabilities of Etsy. They might not always be clear on all the details, but they usually want automation now, think that it is straightforward to get, and want to know what tools they can buy or download to make it happen immediately.

Companies that desire such automation sophistication are usually well intentioned but misguided in where to start. The biggest problem lies in the chasm between the conditions we see and those that need to be in place to deploy tools with any chance of achieving the desired objectives.

Mismatching tooling with ecosystem conditions is like using the wrong type of race car for a race. For example, a Formula 1 series race is designed as a contest of speed and agility. Cars that are driven in it are optimized for those dynamics, but as a result require impeccable track hygiene. One stray piece of garbage or a poorly surfaced track and someone is likely to get hurt. On the other end of the spectrum is off-road racing. While it is still a car race, it is a contest of endurance through rough and unpredictable conditions. Off-road vehicles have to balance ruggedness and speed, making them far slower but more durable than their Formula 1 counterparts.

Images

Figure 10.1
Bill’s misjudgment of race cars cut short his off-road racing career.

Automation works in much the same way. Some of the time the mismatch means that, like the off-road racer in a Formula 1 race, the tool ends up not being particularly useful. This happens, for example, when build and code repository practices aren’t aligned to installed automated build and continuous integration (CI) setups. Putting automation around a clunky, slow, or fragile build process tends to only add complication. The same is true for CI if developers check in code infrequently, there is excessive code branching, you are only building and integrating small parts instead of everything together, or the delivery process is heavily phase-gate. The automation is supposed to provide rapid feedback into the state of the entire code base. This is difficult to do if you are only occasionally getting feedback from a small percentage of this process.

Mismatches can likewise cause frustration or even additional wasted effort. One of the most common places this occurs is with monitoring and data analytics tools, where it takes little for a mismatch to have an outsized effect. On the service side, mismatches tend to take the form of inconsistent, inadequately documented, or poorly maintained interfaces and data, any of which can make it difficult to correlate and understand what is happening within and across systems. Heavy reliance on proprietary vendor-supplied software and services can be so impenetrable as to create a large analytic black hole in your ecosystem.

Often, mismatches happen because the software or services themselves were simply not designed with analytics or instrumentation in mind, leaving poor-quality data artifacts that contain far too much noise to use for surveillance and understanding. Other times, the problem is actually with the choice of monitoring or analytics tools and their configuration. I regularly see such tools become blind or errantly spew garbage because either they have been deployed in an environment they were not designed for or there has been some subtle change to the inner workings of a process that causes the metrics or data the tools once looked for to no longer be relevant.

The dangers of mismatches can be extremely hazardous. Just as Formula 1 cars need heavily controlled track conditions, some tools likewise are extremely sensitive to mismatches. But unlike Formula 1, it is usually the ecosystem that suffers.

This is particularly the case with automated deployment tools. The challenge of overly tight service or data coupling can make automated deployment difficult. However, the bigger problem comes from simply not having a clear and sufficiently consistent model of your deployment ecosystem. If deployment targets aren’t organized with high configuration hygiene and minimal unknown or unmanaged state drift, automating deployment greatly increases the chances for unforeseeable or uncontrollable deployment and configuration problems that are irreversible.

Not all mismatches are necessarily technical in nature. Some of the most difficult mismatches to deal with affect what and how people go about their work. Automated deployment and public cloud solutions can deeply affect established procedures and have very real legal and regulatory consequences. Adapting people and processes to new delivery approaches can cause real business headaches for organizations. If the new technologies and processes lack sufficiently clear and trackable governance controls and protections, any perceived shortfall can even limit what markets and customers can become customers. If automation choices are not made carefully, they also can make the jobs of staff far more cumbersome and error-prone, leading to pushback that can undermine the viability and effectiveness of the automation initiative itself.

Building Sustainable Conditions

So how do you go about finding and eliminating mismatches? Your first instinct might be to get an idea of what the conditions are within your ecosystem. But to do that you first need to know where to look.

Chapter 6, “Situational Awareness,” discusses how gaining sufficient context takes far more than just looking around. Everything from mental model biases to poor or misleading information flows can lead you astray. A much more effective way is to take a page out of Lean Manufacturing. Lean relies upon two desires that are commonly held by staff. The first is to be proud of your workplace and accomplishments within it. The second is that the work ecosystem is set up such that performing work the correct way is easier than performing work the wrong way.

How Lean accomplishes this is through an approach called 5S.

5S

Like IT automation, factory tools and automation only work well when the ecosystem is known and clean and the tools match what workers need to turn materials into the desired end product. The Lean method for organizing and maintaining an orderly workplace is called 5S. The traditional 5S pillars are sort (seiri), set in order (seiton), shine or clean (seiso), standardize (seiketsu), and sustain (shitsuke). They fit together like Russian matryoshka nesting dolls, with each subsequent pillar building upon and reinforcing those before it.

On the surface the names of these pillars might sound like irrelevant janitorial tasks to many in IT. What on earth gets “shined” in IT anyway? Don’t concern yourself with the terms themselves, as they are no more than a mere shorthand for something much bigger. Even manufacturers who only focus on the terms see little benefit. For both manufacturing and IT alike, it is the intent behind each term that is relevant. Table 10.1 provides a guide to help.

Table 10.1
The 5S Pillars

Pillar

Meaning

IT Equivalent

Seiri

Sort

Know and classify the artifacts within your ecosystem and remove those that are unnecessary

Seiton

Set in Order

Organize artifacts and workspace sensibly to maximize effectiveness and minimize uncapturable change

Seiso

Shine/Clean

Maintain/improve artifacts, tools, and workspace to ensure their utility

Seiketsu

Standardize

Build/maintain/improve common shared frameworks and processes that support end-to-end maintenance and transparency across ecosystem while encouraging continuous improvement

Shitsuke

Sustain

Build/maintain structural and managerial support to sustain continuous improvement

Let’s take an in-depth look at each of these 5S pillars to see how they can help us on our journey.

Seiri: Know What Is There
Images

Figure 10.2
“It looks like we have all the packages ready for the next release!”

The first and most central pillar of 5S is sort (seiri). The purpose of seiri is to get a handle on all the bits that make up your environment, from the code, infrastructure, and other building blocks of the service stack to the tools and processes within it. It is not a simple inventory audit or discovery exercise. Those only provide a snapshot of what you might have in your environment. Seiri is necessarily continuous to maintain awareness of what you actually need and why you need it.

Finding which components exist in your ecosystem, what they do, who uses them, who owns them, where they live, as well as how they enter, move around, and leave the ecosystem is obviously useful. It improves your awareness of what is going on, helping you spot risks and potential mismatches that could cause problems that might complicate automation efforts.

Trying to identify the purpose of everything also helps you to spot those things that are unnecessary, no longer needed, or really shouldn’t be where they are. In manufacturing, seiri is seen as a useful mechanism to tackle waste. Organizations habitually collect all sorts of clutter. Often little time is devoted to removing it because either people are too busy or the effort simply doesn’t seem worthwhile. Some might even think they are being clever by leaving it around. Seiri emphasizes why ignoring or keeping “clutter” is the wrong approach.

By getting rid of everything that is not directly needed for delivery, you eliminate many of the places where things can hide as well as the noise that obscures your understanding of what is going on. A good analogy is clearing your yard of leaves or dry brush. Clearing a yard makes it easier to avoid tripping on your garden hose or stepping on a snake that has decided to take residence in your yard. Removal also has the added benefit of eliminating something like a piece of code or errant configuration that can cause catastrophic problems.

Seiri is a good way to get everyone thinking about the purpose of everything in the ecosystem, which not only improves everyone’s understanding but also inevitably encourages people to think of ways to improve. Seiri also greatly reduces the number of ownerless orphans that can lead to potentially disastrous surprises down the line.

Seiton: Organize So It Is Useful
Images

Figure 10.3
Mary found that keeping her work environment organized sparked joy.

Knowing what you have and getting rid of what you do not need is a great way to avoid becoming the next Knight Capital. But knowing you have something is only useful if you know where it is and that it is available to use when you need it. That is where the second pillar of 5S (seiton) comes in. Seiton is about organizing the various parts of your ecosystem in a sensible order. In a factory this means creating a well-laid-out workbench with each item easily locatable in the most sensible and convenient place for it. Organizing everything logically improves flow and accuracy and keeps users of these resources aware when something is amiss.

For those who have never spent any significant time in a well-organized environment, this probably seems more like a nice-to-have than an important requirement. It might be a little frustrating to have to hunt for things from time to time, but organizing takes time away from other, more important tasks you might make money from.

The reason these beliefs endure is that this same lack of organization obscures its actual cost. A useful way to demonstrate this is by shifting this disorganization and lack of care to something where the dangers are far more obvious.

Imagine for a moment that you have to go to the hospital for minor surgery. You are directed to a room and told to get up on the operating table. As you enter the room, you find it strange that there is a pile of surgical tools in various states of repair heaped in a pile on the floor in the corner of the room. A short time later the doctor comes in with a nurse behind him. The doctor asks you to lie down as he starts to pick through the pile.

After some time, the doctor reaches into his pockets, pulls out a couple of old spoons and a steak knife, and says “I’m glad it was soup and steak day in the cafeteria! I can’t find the right surgical equipment, but these should be good enough for the job.” During this time the nurse has been filling a syringe from one of the many vials on the cart. As he sticks it in your arm he says “All the drugs look the same. This one is closest to me so it will just have to do!”

Such a scenario would likely have you desperately trying to run out the door. The risk to your well-being is both high and obvious, and one you would likely only tolerate in a disaster situation where your life was on the line and there were no other options.

Yet hospital environments do not naturally organize themselves. While you might not find surgical equipment on the floor, keeping equipment and medicines organized in a way that minimizes mistakes is still an ongoing challenge. In fact, the Institute of Medicine’s first Quality Chasm report stated that medication-related errors accounted for 1 out of every 131 outpatient deaths and 1 of 854 inpatient deaths, amounting to more than 7,000 deaths annually.1

1. Institute of Medicine. To Err Is Human: Building a Safer Health System. Washington, DC: National Academy Press; 1999, p. 27.

Organizational problems are one of the most common causes of medical errors in hospitals. Many medications have look-alike/sound-alike names and are stored in hospital pharmacies in identical containers. This is why hospitals increasingly use seiton and other 5S techniques like differing the color and shape of medication containers; organizing medical carts and cabinets for separation to reduce the likelihood of mistakes; and using barcodes and other secondary checks to flag errors before they cause life-threatening problems.

While not always so obviously life-threatening, disorganization in IT can be just as dangerous. When under pressure to deliver and code is difficult to work with, tools are new or cumbersome, or establishing a working environment takes too long, people will inevitably cut corners. Just like the doctor using cafeteria utensils, the sloppily copied code, poorly managed dependencies, or repurposing of some random or less suitable deployment target might seem “good enough” to do the job.

While seeming carelessly reckless to outsiders, this mentality exposes the organization to unnecessary variability and risk. This makes code harder to read and work with and problems harder to troubleshoot, often wasting more time and effort than is saved through the original shortcuts. For automation, such variability creates additional edge cases that those deploying automation tools will inevitably need to uncover and deal with to ensure their efforts are both useful and work reliably. When tools or even large sections of the supply chain are of unknown quality or provenance, as occurred in the SolarWinds and codecov2 security breaches, these unknowns can even introduce outright dangers to your business.

Just like hospitals, seiton efforts in IT should be designed to make it both easier and more worthwhile to do things the right way than to use shortcuts. Often this means exposing and fixing the root causes for dysfunction. For instance, poorly laid out code repositories and excessive branching often lead to check-in mistakes and painful merges that increase mistakes and consume time to sort out. These problems are often caused by unfamiliarity with version control tools and their best practices; poorly designed and implemented code; and inadequately defined and managed work. Sloppy coding and poor dependency management are often a result of a combination of misperceptions about the problem space; unrealistic time and resource constraints; and hidden or insufficiently understood existing technical debt.

This is the same case on the operational side. Slow and cumbersome provisioning processes, poor software packaging, and inconsistent or manual installation and configuration not only are frustrating but encourage people to develop workarounds that create yet more variability that must be dealt with. You can break the loop that causes the problems in the first place by organizing your ecosystem through improved packaging, configuration, and provisioning, along with measures that reduce the need to log into deployed instances.

Seiso: Ensure and Maintain Utility
Images

Figure 10.4
Proactive maintenance keeps everything in optimal working order.

The third pillar of 5S, seiso, is about creating mechanisms and a culture that maintains and improves the organization, health, and utility of the artifacts and processes in the ecosystem. Seiso reinforces seiri by helping build the mechanisms that keep unnecessary clutter away, and reinforces seiton by keeping those organization efforts effective.

Maintaining the health of your delivery ecosystem seems like a reasonable thing to want to do. We know that everything from our car, our computer, our house, and even our own body needs some care and attention from time to time to stay in reasonable working order. Sadly, this is hardly the case in IT. Everything from code and systems to the processes that deliver and operate them are habitually neglected. Bugs and outdated or suboptimal processes are frequently deprioritized or ignored, allowing them to linger until they cause a major event that makes addressing them absolutely necessary. The same goes for systems. Once they are in place, they tend to be rebuilt from the ground up only when they absolutely have to be, and then often grudgingly.

Some argue that the main culprit is the sense that maintaining and improving is throwing more money away on what should be an already solved problem. It is easy to blame managers and modern business practices like zero-based budgeting that tend to favor new features and products with seemingly clear yet unproven returns on investment over trying to tackle existing buggy or unreliable setups with far more opaque benefits. But as anyone who has had to argue to prioritize a bug fix or tried to encourage team members to proactively refactor code or rebuild systems knows, development and operational teams are just as likely to prefer working on the new and shiny over caring for what is already there.

The rapid and continuous evolution of IT ecosystems makes it particularly important to tune and maintain the systems, software, and processes that they are composed of. This is true even in relatively stable and simple environments with few changes in requirements. Technology shifts still regularly make technology stacks obsolete. Hardware and software alike become increasingly expensive to maintain as vendors declare older versions to be end-of-life (EoL). But that is hardly the toughest problem. Many banks and large businesses with legacy IT systems can tell you of the high expense and difficulty involved in finding people who both know and are willing to support long-obsolete technologies. No one likes to try to take on supporting a system that is old and cumbersome, especially when those who built it are long gone and the organization has no interest in improving or modernizing what is there.

There is also the problem that long-lived legacy systems tend to collect so much clutter that staying sufficiently aware of what’s on the systems and how it all works becomes extremely difficult. This creates the worst of all scenarios, which is a progressively more difficult to support component that becomes increasingly difficult to improve or replace without unknowingly breaking something important.

Implementing seiso doesn’t mean constantly chasing the flavor of the month, but it does require staying within the mainstream by establishing a culture that takes ownership and pride in the health and well-being of the ecosystem. Encourage people to point out problems and try to improve the ecosystem in a way that both tackles the problems and is aligned with the outcomes the organization is trying to achieve. This is where the learning kata and kaizen concepts (introduced in Chapter 7, "Learning") come in handy. They help people look at problems and, rather than take on and solve a huge problem all at once, try to incrementally improve and learn in a more sustainable way.

To get there, however, you need more than excited people and a process. This is where the last two 5S pillars come in.

Seiketsu: Transparency and Improvement Through Standardization

Even though the entire ecosystem benefits from the first three 5S pillars, their primary focus tends to be on organizing and improving things at the function level. Seiketsu’s primary purpose is to remove the variability that gets in the way of understanding the state of the ecosystem well enough to be able to confidently deliver predictable and reliable results. This improves cross-ecosystem transparency, while also building up a level of complexity-cutting awareness and familiarity to help different functions work more effectively with each other.

It is important to note that these standards are not the traditional “best practice” rules that are created and enforced from the top down. Instead, they are developed and agreed upon by those in the trenches to reduce the amount of variability and “noise” that makes it more difficult to deliver the target outcomes.

There are many excellent places to implement seiketsu in the delivery ecosystem. On the development side, seiketsu usually is implemented in tools that enable any developer to work effectively on any piece of code and in any project or team with little ramp-up time. These tools include such things as coding style standards, standard logging and data structure formats, observability instrumentation, and build and development environment standards. While not always implemented perfectly, these are fairly common in very experienced enterprise software development organizations.

On the delivery and operational side, the most important places to implement seiketsu are in software packaging, deployment, and environment configuration management. Having clean and predictable configurations and flows that are authoritatively known and easily reproducible eliminates many of the variability problems that delivery and support teams face. Clean configuration management also makes it far clearer when and where problems crop up by eliminating the noise where they can otherwise hide.

More controversially, development and operational teams should also try to use the same repository and task management tools, if not the same repositories and task queues (with, of course, the appropriate permissions scheme in place to manage who can do what where), when it makes sense to do so. When there are vastly diverging needs between teams, this might not be possible. However, often teams choose different tools and repositories out of personal preference rather than out of any real need, often unknowingly introducing unnecessary friction and risk into the delivery ecosystem as a result. Such teams inevitably begin to struggle to manage work or collaborate whenever code and tasks need to move back and forth between teams and their different systems. Not only does it take more effort to keep the systems in sync, it also increases the risk that tasks and code might be mishandled or distorted in ways where situational awareness becomes damaged.

The alignment that seiketsu creates can streamline work considerably, providing two benefits:

  • It discourages doing something the wrong way because it takes more effort than doing it the right way. For instance, creating a deployment-ready atomic software package means less effort to manually install, troubleshoot, and remove the software on its target than a more irreversible “spray and pray” approach.

  • Streamlined standardization can enable locally specific elements, such as new automated test harnesses and business intelligence tools, to be easily added to preexisting pipelines across the delivery lifecycle.

Processes can also follow this approach. Teams can agree to provide standard interfaces to improve information/code sharing, collaboration, and timing in ways that still allow internal flexibility. This is the intent of both the Queue Master and Service Engineering Lead positions. In both cases, this approach improves predictability and reduces friction in the flow.

It only takes a little bit of forethought to start to standardize in such a way as to allow you to start linking the more critical tools and processes together. Linkages also do not all need to happen at once, or everywhere. They should happen in those places where they can provide a lot of value, in the form of either new insights or smoother and less error-prone delivery.

Shitsuke: Building a Support System to Ensure Success

Shitsuke, the final pillar of 5S, focuses on building a culture that maintains the structural and managerial support needed to sustain the previous four pillars. Shitsuke is the recognition that work and improvement initiatives are not sustainable when done in total isolation. They need a structure that encourages awareness, cross-organizational cooperation, a sense of progress and belonging, and learning. While everyone plays some role, it is management’s commitment that is most crucial. By understanding and supporting each of the 5S pillars, management can create the conditions that allow those on the ground to sustainably communicate, collaborate, and continuously strive toward ever better ways of delivering the goals of organization.

Managers are uniquely positioned to shape how information flows, is understood, and is acted upon. Through their peers and superiors they have the potential to see and help with the coordination of activities across several teams. Management can facilitate information flow and cross-organizational alignment, which helps with everything from knowing the state of the ecosystem to organizing, improving, and creating sensible standards to help meet target outcomes.

Managers also tend to either be part of or have more access to executive leadership, enabling them to play a role in shaping strategies that define organizational objectives and target outcomes. They can do this by improving executive awareness of conditions on the ground, as well as by ensuring that they convey the intent behind outcomes accurately to those in the trenches.

All of this works if these same managers create and nurture the conditions that encourage people to set up the mechanisms that allow information to flow. This means cultivating a culture of safety and trust. People at all levels need to feel comfortable speaking up without fear of retribution, be willing to challenge assumptions, and try new things in the pursuit of the target outcome. This is true even if they later prove to be wrong, as long as the lessons learned from the experience are shared and accepted.

Micromanaging how work needs to be performed, encouraging zero-sum rivalries between teams, and punishing failure can certainly quickly destroy any sense of safety and trust. But so can more-innocent actions like rolling out a new tool, technology, or process without at least trying to include those who have to use it in the decision process somehow. This isn’t to say that everyone needs to agree. However, if you are unable to convince many who are affected (skeptical or otherwise) how this 5S approach creates conditions that help them better achieve organizational target outcomes in some demonstrable way, trust will break and the solution will never achieve its potential.

Seeing Automation 5S in Action

Now that we have walked through each element that makes up 5S, how does it all come together to help with IT automation? There are lots of examples from every part of the service delivery lifecycle that can be drawn upon. I have chosen two examples from my own experience. One is an on-demand service start-up. The other is a large Internet services company.

The Start-up

Start-ups can rarely afford the luxury of hiring large numbers of staff before they have proven their offerings have found a profitable market where they are in demand. This is especially true when you are an on-demand online service start-up, where demand can rise and fall quickly. Such companies are often the perfect place for all sorts of automation. Not only does automation reduce the need for large numbers of staff to deliver and scale services up and down quickly, but when they're done well, customers are unlikely to worry about service failures from provider staff burnout.

It is relatively easy for a new business starting from scratch to automate everything from the start. However, without taking the care of implementing 5S principles, it is easy to damage your ability to remain quick and lean once services are live and in use with active customers.

I was brought in to run a major part of the delivery and operations at a start-up that had managed to work itself into a situation that made service delivery automation difficult to maintain and grow. Not only had the software become extremely complex, with tight coupling between components throughout the service stack, in the pursuit of new business, customers had been allowed to heavily customize their instances. For some, this customization went all the way down to the hardware and operating system. The software customizations also went deep into the code, to the point where some customers had their own code branches.

All of this customization was made worse by the fact that customer software stack versions were not kept in a tight, consistent band. A new customer would often get the latest version of the software stack, while older ones would have to wait until there was both agreement from the customer and capacity by the delivery team to perform an upgrade.

Delivering upgrades was also difficult. The complexity and variability meant that there was little possibility for consistency between development/test and production environments. This was so bad that the software rarely would work as expected, if it worked at all, when deployed. This led to constant firefighting and manual hacking in production. This made it nearly impossible to know how anything was configured, let alone reproducible.

Putting in tooling that would be useable and deliver predictable results meant straightening out all this mess. The only way to make such a clean-up both work and stick was to somehow in the process make it far easier for everyone to take the same organized approach than to continue the freewheeling approach they were currently using.

To find a path out of the situation the start-up had found itself in, my team and I decided first to compile a generic stack configuration that could be used as the most basic starting recipe (a form of seiton). If we were clever, we could change the way we organized our environment, separating out elements into generic, consistent building blocks. Any customer-specific differences would then be added as needed in a trackable way at assembly time. For instance, we knew that a web server was made up of a Linux OS instance, an Apache web server, and some modules. The modules might differ depending upon the specific customer, but these could be added or subtracted from the build manifest at install time. Configurations were also typically files that could be templated with customer- or instance-specific differences being defined and added as necessary.

We then built standardized versioned packages of that generic base and automated the deployment and configuration of it. Because we did this before such tools as Terraform, Puppet, Ansible, or Chef were available, we used a souped-up version of Kickstart that pulled from repositories that held the packages and configuration details.

We then set about to convince the rest of the engineering team to organize and standardize their ecosystem.

We started small by building an automated continuous integration system with which we regularly built and packaged some of our tools. One of our early tools was the ability for a developer or tester to “check out” a fresh base instance built from our automated deployment tool.

Up to this point building and refreshing an instance to work with was slow and laborious. A tool that could build out a standard instance quickly and reliably was a huge timesaver. Developers quickly wanted to have more of the stack included in the build in order to further speed things up. What we demanded in return was the following:

  • All the software needed to be cleanly packaged so that it could be installed, configured, and removed atomically (seiso).

  • Each software configuration file that was to be automatically installed had to have any customizable values tokenized so that the tool could fill in at install time.

  • All customized values would be generated based upon the rules of the roles that instance was a member of. So, a development Linux web server instance in Denver would be a member of a development role, a Linux role, a web server role, and a Denver role and would acquire the software and have configurations generated from the rules associated with those roles.

It did not take long for most of the stack to be onboarded. As developers were also very keen to quickly reproduce customer configurations, developers and managers soon became willing to spend the time cleaning up and standardizing those configurations as well.

As customer instances were steadily cleaned up and standardized, we onboarded them into our management tooling. Difficult hardware and software configurations steadily melted away. In most cases customers were eager to make the trade for the benefit of being on instances that could be built, upgraded, and scaled up seamlessly in little time. Occasionally, a customer resistant to being onboarded might be ditched if company leadership decided that it was more cost-effective to do so.

As we had built the automation the way we did, the configuration repository was, in effect, an authoritative source of truth for the configuration of anything that it managed. This meant that we could run a “diff” and see the configuration differences between environments. We could also trace dependencies, and use this information to drive monitoring and other supporting mechanisms. Being versioned also allowed us to look at who made what changes to what configurations when and for what reason. This could be used for everything from auditing to troubleshooting to find the causes of perceived performance or behavior changes to services over time.

The power of having an organized representation of our ecosystem soon became obvious, as did the problems with any sloppiness in definitions or packaging. Once we were able to describe everything in the ecosystem as a member of so many roles, it was like a curtain had been pulled back. We could suddenly see and understand the state of everything. We started to eliminate unnecessary customizations and configuration drift. We could also use the repository to drive the entire service delivery lifecycle.

The Large Internet Services Company

A couple of us who had worked at the start-up later moved to a much larger established Internet services company. While the company had been extremely successful, for years they had organically grown, adding services by hand. There were various automation tools around, and even a centralized tooling team responsible for building out additional ones. Because there were so many unknowns and lots of disorganization, however, the company found automation to be difficult coupled with unpredictable results.

Shortly after we joined, the company faced a crisis. They had signed some large deals with stringent service level agreements that they had little idea how to meet. We saw this as an opportunity and got to work.

Our earlier experience had taught us that we needed to get a handle on what was there (seiri), organize (seiton), and standardize (seiketsu) where we could. Helper mechanisms like the development environment building tool we had put in at the start-up, along with encouragement from management (shitsuke), would help attract people to use, maintain, and improve upon the mechanisms we were putting in place (seiso), allowing us to not just improve our ability to meet service levels but also to offer better automation tools.

Even though there was a lot of pressure to progress, we still knew that we needed to start out small to succeed. We picked some services from one business unit to begin with. One person looked at ways of incorporating the learning we had from the start-up to build a workable configuration repository that could drive automated operating system installations and software deployments, while another looked at retrofitting existing automated installation and deployment tools, as well as making improvements to software packaging.

While this was happening, a second group sought to identify what was in place. This included more-knowable information, such as what hardware was in place, as well as useful but less accurate information, like what software and services were on the hardware, how they were configured, and who used them. Interestingly, it was this last part that proved to be the most difficult. There were times that we had to disable network ports and wait for someone to complain before we knew what something was.

Once we had some ability to build, deploy, and track in a far more organized way, we moved all new deployments to the new setup. With the information that we had collected, we also could begin to look for opportunities to rebuild what was in place. While there was some initial resistance, much of this faded when development and support teams saw the benefits that were gained by putting in some additional effort.

Once we had a demonstrably working model in one business unit, we started to reach out to others to transform. This continued until we had brought over most of the organization. Along the way we added new automation tools and capabilities such as full CI setups, automated deployment and test frameworks, and continuous delivery pipelines and reporting tools. All the while, any remaining legacy unknown became increasingly isolated from systems and services that had been brought into the new management framework.

Even with all this success, we still encountered some teams that simply refused to use the tool no matter what the benefits were or how much help was offered to make the transition. This was recognized as far from ideal by those at the highest levels of the organization. Eventually, the company decided to make adopting the tool a requirement for being in a production datacenter. They did this by opening new datacenters and allowing only those who were using the tool to migrate there. They then announced one by one the closure of the old datacenters and gave the resistant teams one last chance. It did not take long after the first one closed before the remaining teams onboarded.

Tools & Automation Engineering

Building sustainable conditions is critical for establishing a sufficiently sound foundation for any automation to both remain useful and have predictable results. As in our previous examples, making conditions sustainable might even form much of the initial direction for any automation efforts. But getting automation that is optimally fit for helping the organization both build sufficiently sustainable conditions and successfully pursue its target objectives does not come magically on its own. It needs someone sufficiently skilled to deliver it. That is where the role of Tools & Automation Engineering comes in.

Images

Figure 10.6
As a Tools & Automation Engineer, Dan is dedicated to making the tools that are critical for helping the rest of the organization.

The concept of Tools & Automation Engineering comes out of the recognition that even when people have both the desire and all the requisite technical skills to deliver automation, they might not always have enough time or objective mental distance to do it. Having one or more people dedicated to crafting and maintaining the right tools for the organization helps overcome this challenge. Unless you have a relatively large organization, particularly one that is also geographically distributed, the number of people required for this role is never very high. Even one or two people is often sufficient for most organizations.

The best Tools & Automation Engineers tend to talented scripters and programmers with sizeable exposure to the operations world, including operating systems, networking challenges, and troubleshooting live services. These are often the same sorts of people who naturally believe in the concept of “delivery as code.” They do not necessarily have to be pure operations people. A fair number have a release engineering, backend engineering, or whitebox testing background that has straddled the development and operational world through their work.

This deep exposure to the operational side is very important. It not only enables Tools & Automation Engineers to grasp the concepts necessary to understand and automate the service ecosystem effectively, it also positions them to create solutions that integrate with operational and infrastructure tooling in ways that are more resilient to operational failure conditions than more conventional programmers.

Organizational Details

Tools & Automation Engineers are often the most effective when they are organizationally situated with those responsible for operationally maintaining the “ilities” of the service stack. The reason for this is that the operational environment forms the foundation of the customer’s service experience. By staying intimately aware of operational service conditions, Tools & Automation Engineers can build in the mechanisms that support the 5S pillars at the point where they matter most, and then work backward through the delivery lifecycle to improve the delivery team’s ability to build and maintain the desired service “ilities.”

In organizations that separate development and operations, having Tools & Automation Engineering within the operations team helps Tools & Automation Engineers stay attuned to operational conditions. It also gives operations staff a means to get help to create capabilities they need to automate and support the production ecosystem but either do not have the skills or the time to deliver themselves. Being both aware of operational conditions and well versed in software development, Tools & Automation Engineers can also aid Service Engineering Leads in communicating and bridging between development and operations.

In organizations where delivery teams build and run software, Tools & Automation Engineers provide cloud and CI/CD expertise for the delivery team and ensure that there is dedicated capacity to build and support delivery and operational automation. This is important for preventing tooling designed to support 5S pillars from being deprioritized in favor of new service feature functionality.

Workflow and Sync Points

While Tools & Automation Engineers participate as part of the team responsible for the production service, they are intentionally not part of any on-call, Queue Master, or Service Engineering Lead rotation. This is done to ensure that they can respond to any threats to the 5S pillars in the delivery ecosystem. However, despite this difference, they are still an integral part of team workflow and sync point mechanisms.

As you will see in Chapter 14, “Cycles and Sync Points,” being an important part of helping improve and stabilize Service Engineering maturity means using the delivery team’s workflow and participating in its synchronization and improvement mechanisms. This might seem a bit odd at first, especially as tools and automation work tends to look a lot more like typical development work than operational work; however, besides improving shared awareness, Tools & Automation Engineering plays a very important role in energizing improvement efforts.

Most of the tools that Tools & Automation Engineering end up building are heavily influenced by improvement needs that arise from the team’s retrospectives and strategic review sessions. The reasons for this quickly become apparent. For instance, it doesn’t make sense to automate the deployment of more services if the ones that you have already automated aren’t working reliably. The same goes for instrumentation and troubleshooting tools, build frameworks, and repositories. If any of these is too brittle or unreliable, then creating more of the same will only make things worse.

By participating in retrospectives and strategic review sessions, Tools & Automation Engineers can learn more about the pain points experienced in the operational environment with much deeper context. They can use the sessions to ask questions and propose further exploratory work to surface root causes and come up with tooling solutions to help the team overcome them.

Being part of retrospectives and the workflow also allows Tools & Automation Engineers to spot friction and risk points in work the team is receiving and performing that might be reducible through new or augmented tooling. For instance, a developer or operational person might get used to the packaging and deployment of a particular service being slow and cumbersome. A tooling person can look at the problem with a fresh set of eyes and potentially find a way to fix its root cause.

For Tools & Automation Engineering, work tends to behave quite similarly to a development Kanban. Most work is usually known going into the planning meeting, where it can be prioritized and ordered in the Ready column for tools engineers to pick up throughout the week. Work items can come into the workflow in an expedited way for tooling engineers, though just like expedited Service Engineering work, the same general post-expedited reviews should still take place to see if there are ways to minimize the need for such actions in the future.

While there is some flexibility to allow work items to be somewhat larger in size than is allowed for other Service Engineering work, in general most of it should still be sized to be no more than one or two days’ worth of effort. This both helps improve exposure of what is actually going on and keeps items moving across the board.

Summary

Delivering effective automation is an increasingly critical element needed to ensure the success of modern IT organizations. If done well and in collaboration with the delivery organization and management, it can help improve the organization and maintainability of the infrastructure, software, and services within it. But simply putting tools in place is neither a safe nor sustainable approach. The 5S pillars from Lean of seiri (sort), seiton (set in order), seiso (maintain), seiketsu (standardize), and shitsuke (sustain) allow you to build the situational awareness and predictable structures that successful automation efforts require.

As the delivery team improves their knowledge and management of the delivery ecosystem, they should consider creating a dedicated Tools & Automation Engineering function that sits within the operationally oriented Service Engineering area in order to provide tooling and automation solutions that help the organization meet the target outcomes of customers with an operational context.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.134.29