Chapter 8. Embarking on the DevOps Journey

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8 Embarking on the DevOps Journey

Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.

Melvin Conway

The IT industry is full of people with strong opinions about how best to organize IT service delivery teams. Some insist that separate development and operations teams are necessary to maintain team focus and attract people with the right skills, as well as to meet any separation of duties requirements. Others believe that modern cloud and service delivery tools are good enough that there is no need for special operations skills, let alone having a dedicated team to handle operations work. Then there are those who believe that delivering services is so unique that it necessitates creating a dedicated service-focused “DevOps” team separate from any others to handle it.

Despite your personal preference, the factors that really determine the optimal delivery and support team structure for your organization are the dynamics and constraints of your delivery ecosystem. This is why blindly adopting an organizational structure used by another company probably won’t meet your needs. Not only that, as ecosystem conditions change, so might the factors that affect the friction and risk you and your team encounter as the effectiveness of the mechanisms used to maintain situational awareness, learning, and improvement start to break down. This deterioration in effectiveness can sometimes be so dramatic as to require adjustments to team structures and ways of working to remain effective. Such complexity is why the topic of organizational structures is weighty enough to deserve its own book.¹

1. For those interested in such a book, I highly recommend Team Topologies: Organizing Business and Technology Teams for Fast Flow by Manuel Pais and Matthew Skelton (IT Revolution Press; September 17, 2019), which does an excellent job in analyzing the suitability of various DevOps organizational structures.

Responding to change by constantly reorganizing teams also doesn’t work. Not only is it disruptive, team member stress, along with their misreading of ecosystem conditions, can cause teams to become less effective.

A better approach is to understand the conditions that affect team effectiveness. From there, team members and management can work together to build in stable mechanisms that regularly measure their health. This allows deterioration to be spotted, its root cause to be found, and countermeasures to be put in place to restore and even improve team effectiveness.

You know from the earlier chapters that effectiveness counts on striking the right balance between friction, risk, situational awareness, and learning so that you can make and execute the best set of decisions at the right time to make progress toward the customer’s target outcome. While the importance of each factor can vary over time, fortunately the conditions that determine that balance stay relatively consistent.

The first of these factors is the flow of information necessary for team members to make the most appropriate decisions at the right time to meet customer needs. What makes this tricky is that not all information is “necessary.” To be necessary, information needs to be accurate, available, and relevant enough to ensure the best decision to help the organization’s customers in their pursuit of their target outcomes. It also needs to enable the decision to be made in a manner that leads to timely action. No one benefits from a perfect decision that is performed well after it is necessary or relevant.

Secondly, there needs to be enough information flow across the organization to maintain sufficient cross-organizational alignment. What makes this different from the first factor is that the information needed for alignment can differ between teams and even team members. Success means ensuring there is sufficient shared situational awareness of any activities being performed to troubleshoot, change, or improve ecosystem conditions and capabilities to avoid potentially dangerous “decision collisions” between teams and team members.

Finding that right balance of what information needs to flow where at what rate and time takes more than measuring information flow rates and coverage and adjusting to close any gaps that form. In fact, it is better to approach the challenge from the bottom up by designing the organizational structure and any incentive models in a way that fosters a strong sense of ownership and pride among team members for the whole of the services and capabilities they are delivering rather than just the piece they work on. Feeling a sense of ownership and pride not only makes work more fulfilling, it encourages team members to use their knowledge and skills to collaboratively deliver a whole set of healthy and well-maintained services that sustainably achieve the conditions necessary to achieve the target outcomes.

Achieving these conditions is far more involved than acquiring a bunch of tools, moving to the cloud, or placing operational responsibilities with a team and calling it “DevOps” or “Service Reliability Engineering” (SRE). In order to build a sense of ownership and pride, the team first needs to know and share in the value of the intent behind what you are trying to accomplish. This is why it is important that the structure aids in communicating a clear and consistent business vision that details the problems the target outcomes are trying to solve, along with any constraints that need to be considered when delivering the services. Mechanisms within the structure also need to regularly measure the amount of ordered “knowability” of your delivery domain. This helps everyone understand the areas and levels of “fog” that might hinder information flow.

These next few chapters are meant to be a practical high-level overview of these challenges, with some tips and models of ways of working that you might find helpful as you embark on your DevOps journey. They are based on my experience and are patterns that I have personally found helpful, and integrate much of the thinking espoused in Chapters 3 through 7.

The Tale of Two Businesses

BigCo had long been a successful provider of commercial shrink-wrapped software. Their success had grown into large product lines with many talented engineering teams. As Software-as-a-Service (SaaS) became popular, an increasing number of customers began clamoring for BigCo to come up with a SaaS offering of their products. At first, BigCo ignored all of this, thinking that it was a fad that would never catch on. However, after a couple of start-ups began stealing away their customer base with competing hosted offerings, BigCo decided that they had no choice but to make the plunge.

Figure 8.1
BigCo.

BigCo started by hiring some of the top Operations staff they could find, many of whom had years of experience in the field managing large operational infrastructure. On their advice BigCo quickly built some of the most advanced datacenters out there, full of the latest in automation and cloud technologies. At the same time, the development teams set out to convert their applications into hosted offerings.

At first things seemed to be going well. The services being developed by the development teams looked great. They were at least as easy to use as BigCo’s traditional products, yet didn’t require anything to be installed by the client. Similarly, the datacenters were beautiful. It was easy to deploy thousands of instances of anything with a simple push of a button; however, problems quickly surfaced.

While the datacenters were advanced, much of the operational plan had been left to the Operations staff working in conjunction with the incumbent IT organization. They had adopted many of the request, support, and change control processes that they had long used internally. These processes were not just cumbersome, but much of the advantage of the infrastructure automation was lost along the way. This resulted in such frustratingly slow response and turnaround times for both incident and request handling that even the most loyal customers were threatening to go elsewhere.

Not all was well on the development side either. Developers were accustomed to creating software, leaving the operational responsibilities to someone else. Few had any real experience delivering live production services, let alone those at the scale BigCo was attempting. Interactions between Development and Operations were slow and error-prone, making deploying new releases and bug fixes cumbersome and taxing on everyone involved. These interactions left a large expectation gap that no one was sure how to fill.

Around the same time that BigCo was struggling with their transition, one of the upstart competitors, FastCo, was going through their own challenges. Unlike BigCo, FastCo was a pure on-demand software services company with no legacy shrink-wrap product. They started small, but over time became extremely popular. They quickly grew, with millions across the globe relying upon their services every day.

FastCo prided themselves on delivery speed. They never felt much need for a separate Operations team. Instead, they relied upon developers and their cloud provider to run the service.

Everything seemed to be going well for FastCo until the day they faced a rather unexpected problem. Their success had begun to catch the attention of big customers in their space. One Fortune 500 company was so impressed that they struck a particularly lucrative multiyear deal with FastCo to provide their services.

For FastCo this was a big win. It not only guaranteed the business a steady revenue stream in the tens of millions of dollars a year, but it also upended the argument competitors like BigCo used stating that no major business would buy such important software as a service. The downside to this big win, however, was that FastCo agreed to a contract where they had to deliver their services with a 99.9% service level guarantee.

Figure 8.2
FastCo.

Service level agreements (SLAs) were entirely new to FastCo. While they knew that service uptime and reliability were important and took great effort to keep everything running, nobody was sure what the uptime really was as no one had bothered to measure it before. With very specific contract measures now in place, FastCo had to figure out how to measure and meet them, fast.

There was also the problem of their cloud provider. Not only were the SLAs with them very different than the ones they had just signed up to, they knew there were times when momentary lag and drops in those cloud services created errors and unpredictable behavior in their own services. No one was sure if it was even possible that they could protect themselves from these.

Figure 8.3
It is easy for FastCo-like companies to overcommit and put the entire business at risk.

These SLA commitments also got some at FastCo thinking about business continuity. If something really bad happened, such as a production environment suddenly disappearing, could they recover in a timely fashion? They knew that they could deploy service instances quickly, which certainly helps. They even had a number of handy tools from their cloud provider that could allow them to turn up any requisite infrastructure elements as needed.

But anyone who has done business continuity planning can tell you that deployment of infrastructure and software is only a small part of the puzzle. Maintaining current and accurate environment configurations is even more important. Small and seemingly unimportant differences can have hugely important behavioral and performance consequences. Upgrades, manual troubleshooting, and newer fresh instances can all create subtle but consequential drift that can go undetected until it is too late. Similarly, configuration friction such as hard-coded IP addresses and pathing can also creep in throughout the service side as well as even on the client. This can make it difficult to move or re-create services elsewhere.

There is also the whole data resilience and recovery problem. Customers need their data to be both available and accurate. While the means available for storing and replicating data are vast, few service providers spend the time to really understand the nature of the data generated, stored, and manipulated enough to ensure that the right mechanisms for storing and handling it are in place and configured optimally. How can service data be handled so that loss and corruption can always be kept to a minimum acceptable level?

As FastCo’s staff began to look at the situation, they were shocked. Not only would recovery be difficult and time consuming, but no one was confident that anyone could reliably restore everything they had. Even though engineering skills and experience on the team were deep, the fact that no one had been formally tasked with ensuring any of those things at any meaningful level meant they were often overlooked or ignored.

How would these two very different companies overcome such challenges?

The Service Delivery Challenge

It might not be apparent at first, but both BigCo and FastCo face much the same problem. Both have failed to establish the level of situational awareness within their delivery ecosystems necessary to ensure their services efficiently and effectively meet their customers’ needs. What is more interesting is that the root causes of this failure arise from shared origins.

Despite having very different models, both organizations still manage to retain many old habits and outdated mental models inherited from traditional IT. Rather than trying to determine how best to help the customer reach their target outcomes within the overall business vision, this mental framing causes those delivering and operating the services to focus on aspects of delivery that more closely align with their areas of expertise.

At first glance this narrowing of focus might seem to be a great way of avoiding unnecessary distraction. It was also one that could work when software was handed to a customer to run on their own infrastructure. The split in roles created enough plausible deniability that anything beyond one side’s immediate area of responsibility could easily be deflected.

On-demand IT service delivery invalidates plausible deniability by eliminating this division. Customers rightly believe the entire ownership of the service lies with the service provider. From the customer’s perspective it is the service provider’s responsibility to make sure not only that everything works as expected but that the provider does whatever it takes to know what those customer expectations are.

With the obligation to build and operate the service entirely in the service provider’s remit, there simply is no room for the sort of ambiguity and finger-pointing that can cover for poor communication flow across the service ecosystem. As a result, service providers must find ways to understand what customers need from the services and then establish strong feedback, reflection, and improvement mechanisms necessary to deliver them to meet those obligations.

Traditional Delivery Fog in the Service World

Let’s look back at BigCo and FastCo to see what exactly is causing them to both fail to establish the necessary situational awareness and improvement loops that are essential to succeed.

There are few surprises in the failures at BigCo. They have simply replaced the customer’s IT operations with their own. Despite possibly having more advanced datacenters, tools, and dedicated staff, all the old split-responsibility habits that shaped team ways of thinking over so many years as a traditional software company still linger. Many of the same organizational structures remain in place, fragmenting situational awareness and any chance of developing shared responsibility for ensuring the service stays attuned to what the customer desires.

If this is not bad enough, the fragmentation caused by the organizational structure encourages teams to assess and reward teams based upon more traditional and locally focused metrics. This creates little incentive to close any awareness gaps, let alone collaborate on work outside the immediate remit of the team that is unlikely to be rewarded by the team performance assessment mechanism.

While FastCo has no legacy organizational structures to disrupt information flow, there are still signs that some traditional mental models endure. For starters, developers are often assessed on delivery speed and output, not on understanding and achieving customer outcomes. Nobody intentionally tries to build anything to meet outcomes they are not incentivized to achieve.

The problems do not end there. Few developers are motivated to learn how to manage the operational side of a service environment any more than the minimum absolutely necessary. What many fail to realize is that knowing your own code and having significant technical knowledge does not automatically make you the best person to both operationalize and operate it. This lack of operational knowledge and available expert guidance can lead to poorly crafted or dangerously unsuitable software being released with little more than a view that it seemed to work fine in the development environment.

The lack of operational awareness not only creates challenges with service design and coding but also can have a real impact on the effectiveness of service hosting configuration design, instrumentation, and operational management mechanisms. Many mistakenly believe that operational details, and even the whole of Technical Operations, can be easily made to go away by pulling in enough operational tools and solutions from outside cloud platform and infrastructure providers. But such thinking is as flawed as believing that you can become as talented as a famous basketball player by wearing the same brand of shoes as them.

For starters, having great tools does not necessarily mean you know how to use them effectively. In fact, the increased depth of complexity and potential lack of transparency they can create in the service stack can make it harder to understand the nuances and interactions within and across the stack, even when you are actively trying to look for them. This increases the risk that some element critical for service health will be missed.

You can also introduce otherwise avoidable problems by simply trusting that external providers who deliver operational functions will automatically match the service quality levels your customer expects. FastCo discovered this the hard way by committing to service level guarantees far higher than those of their upstream provider with little forethought on how to close the gap.

Uptime and reliability are far from the only potential mismatches. There can also be others in the performance, scalability, defect remediation, and security offered by upstream providers that, without sufficient operational awareness, can remain hidden until it is painfully expensive, disruptive, or time consuming to mitigate. This can lead to unexpected behavior and embarrassing incidents that can put your relationship with your customer base at risk.

To deliver services that meet the objectives and expectations of your customer, you not only need to overcome the various obstacles that can get in the way of your awareness of service ecosystem dynamics, but you also need to be sure your service aligns with the “ilities” customers require.

The Challenge of the “ilities”

It is far too easy for service delivery teams to become unclear about or pay too little attention to the specific service qualities that the customer expects to be in place to help them achieve their target outcomes. Customers often do not help matters. Some focus on the solution or functional/procedural details for how it is delivered (as discussed in Chapter 2, “How We Make Decisions”) rather than on how various service qualities might affect them. Still, it doesn’t take a genius to know that not delivering what the customer expects is not a strategy for success.

There can be a lot of service qualities that a customer might find important. As most of these qualities end in “ility,” I prefer to refer to them in the same shorthand way as Tom Gilb, a well-known systems engineer and early Agilist, by the term ilities. Ilities tend to fall into two categories:

Service performance requirements (such as availability, reliability, and scalability) to meet customer needs.
Risk mitigation (things like securability, recoverability, data integrity, upgradability, auditability and any other necessary legal compliance capabilities) to meet customer, legal and regulatory requirements.

Ilities are often overlooked for a number of reasons. One of the larger challenges is that when they do surface, they are often labeled as “nonfunctional requirements.” The problem with this term is that it is easy to believe that developers deliver functional requirements, and that any “nonfunctional” ones are either already naturally present or are delivered by someone else. Proving that a non-functional requirement is not present is often left to the testing team, who must then prove unequivocally that it is both missing and that remedying the situation requires otherwise unplanned work. Not only is it tough to prove something doesn’t exist, especially when there can be enough differences between test and production environments to create plausible deniability, but also convincing others that work must be done to satisfy the non-functional requirement for delivery must also fight the bias against creating unplanned work that can delay the overall release.

Delivering the ilities that your customers need also faces the problem that their importance and how they are perceived can differ depending on the nature of the domain where the customer uses the service.

Just as air requirements differ between a runner and a deep sea diver, the ilities needed for a service used by an investment bank trading desk are very different from those needed for a free consumer email service, and both are entirely different from those required for a service that manages hospital medical records. Even within a service type, the ilities might vary widely depending upon service hosting and consumption locations (United States vs. Europe vs. China) and customer type (consumer vs. business vs. government).

Such variables can make what is suitable for one customer (like a US hospital group) completely unfit for another (such as a Chinese government-run hospital). As a result, service providers who know little about an industry, market, or customer can easily either fall foul of customer needs or find themselves limited in what/where/how they can deliver or otherwise uneconomically exposed.

Another challenge with ilities is that they require understanding what delivering them effectively means within your service ecosystem. What elements contribute to delivering them? Are they built into the service or require specific actions to maintain? What is your ability to track how well you are delivering them and counteract any events or conditions that might put their suitability at risk? Do the means required to deliver them limit your ability or options available to respond to new problems or opportunities?

All of these can be surprisingly difficult to do, even in less complex, ordered operational domains. Ilities like availability, scalability, security, and data integrity usually depend upon multiple delivery aspects working together in a known and complementary way. Any one problem or misunderstanding, whether from a technical limitation, broken interaction, or even an environmental hygiene issue, can create a cascade of failures that can both hide the cause and obfuscate the severity of the real and damaging impact on the customer.

With all this lack of poor information flow, missing skills, and poorly understood ilities, how can an organization build up enough ecosystem and operational awareness to confidently deliver services that will meet the needs and expectations of customers?

The Path to Eliminating Service Delivery Fog

Eliminating service delivery fog begins with first knowing the outcomes you are trying to achieve and the intent behind trying to attain them. This allows you to then identify the service ilities that contribute to achieving those outcomes. With that information, you can then start the journey to identify the relative importance of various sources of information, where it needs to flow, and at what rate.

There are a number of ways teams can try to close this awareness gap, from reaching out to customers to instrumenting services and the dynamics of the surrounding ecosystem to better understand how they are used and align to customer expectations.

However, in order to cut through any customer biases and ensure various teams come to the same aligned conclusions, teams first need manager-level leadership to convey an overall business vision. Rather than telling team members what to do, managers need to instead work with each one to help them identify how their work ought to contribute to that vision. As you will soon see, such an approach helps team members find the right signal in the noise and take ownership of delivering the right solution.

The Role of Managers in Eliminating Service Delivery Fog

Many mistake the role of manager as one geared toward telling people what to do. Some even go so far as thinking it is the manager’s responsibility to tell staff how to perform their work.

Anyone who has been effective at management knows the reality is necessarily quite a bit different. For one, managers are not all-knowing entities. Unless the delivery ecosystem is in the predictably ordered Cynefin Clear domain (as described in Chapter 5, “Risk”), they are far less likely to have both more-detailed subject matter expertise and more-up-to-date details of the service technology stack than their staff. Even if they somehow managed to have both the awareness breadth and expertise depth, not only would the manager be incapable of scaling sufficiently to handle a team of a reasonable size and set of responsibilities, they would create an unstimulating working environment that needlessly wasted the capabilities of their own staff while simultaneously limiting everyone’s ability to grow and improve.

Such a combination is, at best, a recipe for unhappiness and failure.

While managers might not know everything, their position does provide them with three important and unique capabilities:

They are part of the organization’s leadership structure, which gives them a seat at the table that can be used as a conduit to capture, convey, and provide clarity around the intent of the business vision.
They act as a chief advocate for their team that can help articulate any capability gaps or obstacles that the team needs leadership help to overcome.
They can help leadership map delivery work to customer ilities, and communicate any important team findings that might affect the larger strategy.

There are a number of great ways that managers can perform this, one of which is the commander’s intent pattern, as detailed in Chapter 3, “Mission Command.”

Another unique capability a manager has is that they have the ability to get the attention of and interact with peers across the organization. This enables the manager to see across the organization, facilitating information flow by helping remove communication barriers that might distort context or limit team members’ ability to learn. The manager can also provide outreach and foster collaboration between teams.

Finally, managers are well placed to see the overall dynamics of their own team. This allows them to act as a mentor and coach to help the team and its members learn and improve. Mentoring is one important way, some of which is discussed in the “Coaching Practice” section in Chapter 7, “Learning.” Part of this mentoring also includes providing team members a space and air cover to step back from the dynamics of the delivery ecosystem from time to time to experiment and fail in a safe and informative way.

Allowing for failure may sound anathema to management. That might be true if carelessly high risks are allowed. But managers can help define and defend experimentation boundaries that allow teams to safely try new technologies and approaches and allow team members to better understand the delivery ecosystem, innovate, and ultimately to improve. By allowing failures to be tolerated, everyone can openly discuss what happened and why something did not work so that they can come up with a potentially better approach in the future.

Frustration and Data Engineering

Helping to reduce service delivery fog is an important part of being a manager. It also isn’t particularly hard if you regularly focus on ways that you can help your team improve their delivery effectiveness.

To demonstrate this, let me give an example of a situation I encountered a few years back when I was asked to step in to manage a group of data engineering teams. The company’s strategic business intelligence (BI) program was going badly. The technical teams were consistently missing their deadlines, and what deliveries they had made were seen as poor quality. The myriad of business customers the company had were rightfully angry, causing some in the company to question whether the program’s budget might be better spent on other initiatives.

When I met with them, the teams were clearly demoralized. They had all been working long hours and had little to show for it. They were defensive when they would hear that there were concerns they were insufficiently skilled or not working hard enough. If anything, they felt resource constrained and could not understand why the company was reluctant to deploy more engineers to the project.

At first glance, the situation seemed intractable. However, as I started to dig into the situation, it started to become clear that the real problem was excessive delivery fog.

There were three causes for this fog. The first was that it was incredibly difficult to see what progress, if any, was being made. The teams all worked using Scrum-style sprints that were two weeks in length. Each team would bring in one business component at a time, with the first group of sprints being dedicated to identifying and then verifying the various data feeds, along with their sources and structures. The second group of sprints were dedicated to writing the code that would then handle the ongoing processing of those feeds for the BI system.

This all sounds fairly sensible. However, the teams never bothered to expose any details about what work was being done at any given time, who was doing it, the amount of effort that was being put into it, when the work might be complete, or what problems were being encountered along the way. All anyone outside the team could see was that a single business component was being worked on. As a result, a component would typically sit In Progress on the scrumban board from two to six sprints, a four to twelve week period during which there was little visibility beyond the engineers working on the feed itself having much grasp of the amount of progress that was being made.

Another challenge was that there was little shared understanding, let alone agreement, between business teams, engineers, and data scientists of the relative importance of each data feed. This affected both prioritization and the accuracy in the way data was handled, cleaned up, and ultimately interpreted.

With no visibility of the work in progress and an inconsistent view of the value of the BI program, business units felt free to change around the data structures in their data feeds to handle new business requirements. Of course, changes were rarely communicated to the data engineering teams even though each data change meant any work on the affected feed had to be thrown out and redone.

Establishing transparency and communication flow was critical. The first step was to improve the visibility of the work being performed by the team. To do this, the engineering teams were asked to break up their work into sensible-sized chunks that were no more than two man-days’ worth of effort. Any artifacts created along the way would then be linked to the given task so that anyone interested could take a look to better understand what was going on.

Breaking up work and linking artifacts had several benefits. One was that the business could get a sense of the progress being made. The engineers doing the work could also both see that they were making steady progress and build up useful information for those times they encountered a problem and needed help.

This building up of artifacts had another benefit in that other engineers could use this newfound transparency to see if there were common feed types that could be generalized into more robust and reusable libraries. This ultimately sped up and industrialized the BI feed onboarding process, while also encouraging business customers toward shared standards.

The next piece was improving engagement and collaboration between the business customers, data engineers, and data scientists. Before onboarding a customer, everyone needed to agree to the relative priority of each feed and target outcomes desired from it. This helped with ordering and direction of work as it hit the engineering teams. Business customers also agreed to share their technical roadmaps, and work toward minimizing changes that would disrupt BI work. If for some reason a feed needed to be changed while it was being worked on, it was agreed that work would stop and the feed would be put to the back of the priority list.

As a result of these actions, the rate feeds were onboarded sped up dramatically. Tensions began to drop and morale improved. The quality of the resulting BI work also improved significantly, helping business customers better achieve the outcomes they were looking for.

Identifying What You Can or Cannot Know

Even though they play an important role in the process, managers obviously cannot singlehandedly eliminate all service delivery fog. The key elements in your service delivery ecosystem that contribute to those ilities that are meaningful or important to your customer not only will span systems, software, people, skills, and suppliers, they also will be affected by the interplay between them. Some, from the software and systems themselves to configuration, access control subsystems, network services, and the data itself, contribute directly. Others, from troubleshooting and recovery tools, skilled staff, and instrumentation systems, are supporting mechanisms that aid finding and fixing ilities that start to drift outside acceptable parameters.

Understanding what these elements are, the relative importance of the ilities they deliver, and how they contribute to them will help you build a map that you can use to learn and improve your ability to deliver them. What is particularly important about building this map is that you will soon realize how much control you have with managing the ilities these elements provide. Elements will fall into the following three distinct categories: the known “manageables,” the known “unmanageables,” and the unknown.

Let’s look at how each category can help you and your team gauge and ultimately shape your organization to best manage your operational risk.

What You Know and Can Manage

Known “manageables” consist of all the elements in your service ecosystem that you know you can demonstrably control that directly contribute to the operational ilities of your service. This is the category where, in a perfect world, all the elements necessary for all critical or important ilities would optimally live.

However, being known and “manageable” is only useful if you are able to monitor, respond to, and recover from any events that jeopardize the delivery of those ilities in a timely and effective way. Let’s take the example of transaction responsiveness. Most customers will have expectations regarding what length of delay is acceptable for a transaction. Having known “manageable” elements that you can adjust or change to shorten growing delays is not terribly useful if you are not able to make those adjustments at a speed or in a way customers can tolerate.

Similarly, it is not enough to simply have the element that is directly and knowingly “manageable.” You also need the supporting mechanisms that find any ility delivery problems with that element and correct them to also be responsive enough. This is one of the most common ways that organizations that try to manage ilities fail. Some will put in place elaborate tools and technologies for managing such ilities as scalability and recoverability. Yet I have seen where they will either not have sufficient in-house skills to use them effectively, or split responsibilities across the organization and include teams that are unable or unwilling to contribute effectively to meet customer needs.

It is important to know such elements and how they contribute to the success of the service, know and test their thresholds, track their current state in production, and ensure that any groups that must act when thresholds are approached are capable of doing so in an acceptable timeframe with an acceptable level of accuracy.

The Known Unmanageables

Known “unmanageables” are all the elements in your service ecosystem that you know contribute to the operational ilities of your service but are not under your direct control. The most common of these today are cloud and PaaS services, such as AWS and Azure, that are relied upon for critical infrastructure and platform services. Just as common as these services but often overlooked are components delivered by commercial hardware (such as a provider of network, compute, or storage equipment) or software (such as database, payment, or security COTS packages) providers, smaller but often just as important caching and gateway service providers, as well as providers of externally run build, test, and deployment services.

Knowing their importance and what they contribute is extremely helpful. But the big thing with known “unmanageables” is to minimize their risk to your ecosystem. With systems and software, this might involve building robust monitoring, troubleshooting, and recovery procedures, along with using more standard or proven versions and configurations in places where they provide critical capabilities.

Another even more effective approach is to engineer and operate your services in a way that is intentionally resilient to any ilities issues those “unmanageables” might create. This is very much the thinking behind Netflix’s Simian Army and Chaos Engineering. For Netflix, they knew that hosting significant components of a streaming service on AWS instances could expose them to unpredictable performance characteristics that might cause lags and frame drops that could hurt the customer’s viewing experience. They also knew that engineers have a tendency to ignore or downplay the threat of ecosystem problems, creating software that struggles to deal with the unexpected.

By using tools like Chaos Monkey and Latency Monkey, engineers knew that “once in a blue moon” sorts of problems would definitely occur, forcing them to engineer their software to resiliently deal with them. This results in lower risks to the business and ultimately the more predictable delivery of the ilities the customer expects.

The Unknowns

Even if you have the ability to manage an ility, it is going to be of little help if you do not know its constituent elements, their state, or how to best operate them to deliver what the customer needs. Ilities of this sort are the unknowns. These unknowns introduce a level of unmanaged risk that can damage or even destroy your business.

Unfortunately, with all the lack of information flow and suboptimal grasp on the service part of delivery, being an unknown is the default state for most ilities. It is also where most service delivery organizations find themselves at the beginning of their service engineering journey. Even if you solve the problem with determining what the customers need, you know that you still have a lot of work to do to close the awareness gap within your delivery ecosystem to deliver it.

You Don’t Know What You Don’t Know

I regularly encounter BigCo- and FastCo-style companies that might seem successful on the outside but inside have managed to get themselves in a big mess. Both models have their own challenges. Surprisingly, even with far less in the way of organizational politics complicating things, I often find that FastCo-types, in their drive to be agile, have somehow managed to build spectacularly deeper ility voids.

One very well-known and often celebrated FastCo-style online service company unexpectedly found themselves in a rather nasty problem of their own making. They had proudly taken continuous delivery to the extreme. The organization consisted of many teams who frequently released new feature functionality into the area they were responsible for. To try out new ideas while still managing risk, they heavily used configuration switches, canary, Blue/Green, and A/B rollouts. These were done so that only a percentage of the user base was exposed to the new code until they were comfortable rolling it out more widely.

This all sounds very sensible. However, there was one important issue that the company had not thought about. Even though they saw themselves as a confederation of different services, customers experienced them all collectively as one product. Teams rolled out independently of one another, often with long A/B rollout cycles. While teams did talk often to each other and coordinate, their lack of thinking of the services as one product meant that they did not think sufficiently about the whole picture.

What this meant was as follows. Each new switch had the possibility of creating a whole new process flow, and with it a new user experience. As the number of different switches grew, the number of potential user experiences multiplied exponentially. Teams were trying out new ideas all the time. They eventually found themselves with tens of thousands of live switches splitting traffic across many millions of potentially different journeys. In fact, there were so many journeys that it was not only quite possible that no user had the same journey twice, but that two users sitting side by side consuming the service at the same time would have very different experiences.

This had many obvious faults. For one, it created both a nondeterministic user experience problem and a service engineering nightmare. It made it difficult to understand what the user community was experiencing. The seemingly infinite number of potential combinations a user could traverse also multiplied the number of edge cases they could hit. Such variability also messed with the one thing they were trying to achieve: gaining insight into their work. How can you ensure you understand what is going on if you cannot definitively know what paths a user took to your code?

Ways the Team Can Eliminate Service Delivery Fog

There are several other ways that teams can, regardless of your organizational structure, enhance their shared situational awareness, all while continuing to learn and improve. Organizing work to make it clearer of what is going on (covered in Chapter 12, “Workflow”), having regular good but lightweight ways of sharing information and ideas (covered in Chapter 14, “Cycles and Sync Points”), and having intelligent and ility-meaningful instrumentation (covered in Chapter 11, “Instrumentation”) are obvious ways to help. There are also ways to restructure governance processes (covered in Chapter 15, “Governance”) and automation practices (covered in Chapter 10, “Automation”) so that they enhance rather than hinder awareness.

Before getting into specific patterns and practices, it is good for you and your team to capture and close any maturity gaps that can hinder information flow and delivering effectively. This is discussed in the next chapter, along with the first of the “duty” roles that a team can use to facilitate operational information flow and spot improvement areas within and between teams. I have used the pattern myself in a number of organizations and have found it a great way of breaking down silos while simultaneously improving cross-team technical coordination and expertise.

The second “duty” role, the Queue Master, is far more critical. It is covered in Chapter 13, “Queue Master,” and is a means to allow each team member a chance to step back from their normal work to improve team situational awareness and spot learning and improvement areas by monitoring the work flowing through the team.

Summary

Effective service delivery requires that teams have the right amount of information flow to know the intent of what is trying to be accomplished, the ilities that customers need to be present, and the elements within the ecosystem that can contribute to delivering them. By restructuring how people look at their roles and larger ecosystem, teams can reduce the service delivery fog that so often obscures their path to success.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8. Embarking on the DevOps Journey

Create new playlist

Sign In

Sign Up

Chapter 8

Embarking on the DevOps Journey