Chapter 8

Embarking on the DevOps Journey

Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.

Melvin Conway

The IT industry is full of people with strong opinions about how best to organize IT service delivery teams. Some insist that separate development and operations teams are necessary to maintain team focus and attract people with the right skills, as well as to meet any separation of duties requirements. Others believe that modern cloud and service delivery tools are good enough that there is no need for special operations skills, let alone having a dedicated team to handle operations work. Then there are those who believe that delivering services is so unique that it necessitates creating a dedicated service-focused “DevOps” team separate from any others to handle it.

Despite your personal preference, the factors that really determine the optimal delivery and support team structure for your organization are the dynamics and constraints of your delivery ecosystem. This is why blindly adopting an organizational structure used by another company probably won’t meet your needs. Not only that, as ecosystem conditions change, so might the factors that affect the friction and risk you and your team encounter as the effectiveness of the mechanisms used to maintain situational awareness, learning, and improvement start to break down. This deterioration in effectiveness can sometimes be so dramatic as to require adjustments to team structures and ways of working to remain effective. Such complexity is why the topic of organizational structures is weighty enough to deserve its own book.1

1. For those interested in such a book, I highly recommend Team Topologies: Organizing Business and Technology Teams for Fast Flow by Manuel Pais and Matthew Skelton (IT Revolution Press; September 17, 2019), which does an excellent job in analyzing the suitability of various DevOps organizational structures.

Responding to change by constantly reorganizing teams also doesn’t work. Not only is it disruptive, team member stress, along with their misreading of ecosystem conditions, can cause teams to become less effective.

A better approach is to understand the conditions that affect team effectiveness. From there, team members and management can work together to build in stable mechanisms that regularly measure their health. This allows deterioration to be spotted, its root cause to be found, and countermeasures to be put in place to restore and even improve team effectiveness.

You know from the earlier chapters that effectiveness counts on striking the right balance between friction, risk, situational awareness, and learning so that you can make and execute the best set of decisions at the right time to make progress toward the customer’s target outcome. While the importance of each factor can vary over time, fortunately the conditions that determine that balance stay relatively consistent.

The first of these factors is the flow of information necessary for team members to make the most appropriate decisions at the right time to meet customer needs. What makes this tricky is that not all information is “necessary.” To be necessary, information needs to be accurate, available, and relevant enough to ensure the best decision to help the organization’s customers in their pursuit of their target outcomes. It also needs to enable the decision to be made in a manner that leads to timely action. No one benefits from a perfect decision that is performed well after it is necessary or relevant.

Secondly, there needs to be enough information flow across the organization to maintain sufficient cross-organizational alignment. What makes this different from the first factor is that the information needed for alignment can differ between teams and even team members. Success means ensuring there is sufficient shared situational awareness of any activities being performed to troubleshoot, change, or improve ecosystem conditions and capabilities to avoid potentially dangerous “decision collisions” between teams and team members.

Finding that right balance of what information needs to flow where at what rate and time takes more than measuring information flow rates and coverage and adjusting to close any gaps that form. In fact, it is better to approach the challenge from the bottom up by designing the organizational structure and any incentive models in a way that fosters a strong sense of ownership and pride among team members for the whole of the services and capabilities they are delivering rather than just the piece they work on. Feeling a sense of ownership and pride not only makes work more fulfilling, it encourages team members to use their knowledge and skills to collaboratively deliver a whole set of healthy and well-maintained services that sustainably achieve the conditions necessary to achieve the target outcomes.

Achieving these conditions is far more involved than acquiring a bunch of tools, moving to the cloud, or placing operational responsibilities with a team and calling it “DevOps” or “Service Reliability Engineering” (SRE). In order to build a sense of ownership and pride, the team first needs to know and share in the value of the intent behind what you are trying to accomplish. This is why it is important that the structure aids in communicating a clear and consistent business vision that details the problems the target outcomes are trying to solve, along with any constraints that need to be considered when delivering the services. Mechanisms within the structure also need to regularly measure the amount of ordered “knowability” of your delivery domain. This helps everyone understand the areas and levels of “fog” that might hinder information flow.

These next few chapters are meant to be a practical high-level overview of these challenges, with some tips and models of ways of working that you might find helpful as you embark on your DevOps journey. They are based on my experience and are patterns that I have personally found helpful, and integrate much of the thinking espoused in Chapters 3 through 7.

The Service Delivery Challenge

It might not be apparent at first, but both BigCo and FastCo face much the same problem. Both have failed to establish the level of situational awareness within their delivery ecosystems necessary to ensure their services efficiently and effectively meet their customers’ needs. What is more interesting is that the root causes of this failure arise from shared origins.

Despite having very different models, both organizations still manage to retain many old habits and outdated mental models inherited from traditional IT. Rather than trying to determine how best to help the customer reach their target outcomes within the overall business vision, this mental framing causes those delivering and operating the services to focus on aspects of delivery that more closely align with their areas of expertise.

At first glance this narrowing of focus might seem to be a great way of avoiding unnecessary distraction. It was also one that could work when software was handed to a customer to run on their own infrastructure. The split in roles created enough plausible deniability that anything beyond one side’s immediate area of responsibility could easily be deflected.

On-demand IT service delivery invalidates plausible deniability by eliminating this division. Customers rightly believe the entire ownership of the service lies with the service provider. From the customer’s perspective it is the service provider’s responsibility to make sure not only that everything works as expected but that the provider does whatever it takes to know what those customer expectations are.

With the obligation to build and operate the service entirely in the service provider’s remit, there simply is no room for the sort of ambiguity and finger-pointing that can cover for poor communication flow across the service ecosystem. As a result, service providers must find ways to understand what customers need from the services and then establish strong feedback, reflection, and improvement mechanisms necessary to deliver them to meet those obligations.

Traditional Delivery Fog in the Service World

Images

Figure 8.4
Delivery fog.

Let’s look back at BigCo and FastCo to see what exactly is causing them to both fail to establish the necessary situational awareness and improvement loops that are essential to succeed.

There are few surprises in the failures at BigCo. They have simply replaced the customer’s IT operations with their own. Despite possibly having more advanced datacenters, tools, and dedicated staff, all the old split-responsibility habits that shaped team ways of thinking over so many years as a traditional software company still linger. Many of the same organizational structures remain in place, fragmenting situational awareness and any chance of developing shared responsibility for ensuring the service stays attuned to what the customer desires.

If this is not bad enough, the fragmentation caused by the organizational structure encourages teams to assess and reward teams based upon more traditional and locally focused metrics. This creates little incentive to close any awareness gaps, let alone collaborate on work outside the immediate remit of the team that is unlikely to be rewarded by the team performance assessment mechanism.

While FastCo has no legacy organizational structures to disrupt information flow, there are still signs that some traditional mental models endure. For starters, developers are often assessed on delivery speed and output, not on understanding and achieving customer outcomes. Nobody intentionally tries to build anything to meet outcomes they are not incentivized to achieve.

The problems do not end there. Few developers are motivated to learn how to manage the operational side of a service environment any more than the minimum absolutely necessary. What many fail to realize is that knowing your own code and having significant technical knowledge does not automatically make you the best person to both operationalize and operate it. This lack of operational knowledge and available expert guidance can lead to poorly crafted or dangerously unsuitable software being released with little more than a view that it seemed to work fine in the development environment.

The lack of operational awareness not only creates challenges with service design and coding but also can have a real impact on the effectiveness of service hosting configuration design, instrumentation, and operational management mechanisms. Many mistakenly believe that operational details, and even the whole of Technical Operations, can be easily made to go away by pulling in enough operational tools and solutions from outside cloud platform and infrastructure providers. But such thinking is as flawed as believing that you can become as talented as a famous basketball player by wearing the same brand of shoes as them.

For starters, having great tools does not necessarily mean you know how to use them effectively. In fact, the increased depth of complexity and potential lack of transparency they can create in the service stack can make it harder to understand the nuances and interactions within and across the stack, even when you are actively trying to look for them. This increases the risk that some element critical for service health will be missed.

You can also introduce otherwise avoidable problems by simply trusting that external providers who deliver operational functions will automatically match the service quality levels your customer expects. FastCo discovered this the hard way by committing to service level guarantees far higher than those of their upstream provider with little forethought on how to close the gap.

Uptime and reliability are far from the only potential mismatches. There can also be others in the performance, scalability, defect remediation, and security offered by upstream providers that, without sufficient operational awareness, can remain hidden until it is painfully expensive, disruptive, or time consuming to mitigate. This can lead to unexpected behavior and embarrassing incidents that can put your relationship with your customer base at risk.

To deliver services that meet the objectives and expectations of your customer, you not only need to overcome the various obstacles that can get in the way of your awareness of service ecosystem dynamics, but you also need to be sure your service aligns with the “ilities” customers require.

The Challenge of the “ilities”

It is far too easy for service delivery teams to become unclear about or pay too little attention to the specific service qualities that the customer expects to be in place to help them achieve their target outcomes. Customers often do not help matters. Some focus on the solution or functional/procedural details for how it is delivered (as discussed in Chapter 2, “How We Make Decisions”) rather than on how various service qualities might affect them. Still, it doesn’t take a genius to know that not delivering what the customer expects is not a strategy for success.

There can be a lot of service qualities that a customer might find important. As most of these qualities end in “ility,” I prefer to refer to them in the same shorthand way as Tom Gilb, a well-known systems engineer and early Agilist, by the term ilities. Ilities tend to fall into two categories:

  • Service performance requirements (such as availability, reliability, and scalability) to meet customer needs.

  • Risk mitigation (things like securability, recoverability, data integrity, upgradability, auditability and any other necessary legal compliance capabilities) to meet customer, legal and regulatory requirements.

Ilities are often overlooked for a number of reasons. One of the larger challenges is that when they do surface, they are often labeled as “nonfunctional requirements.” The problem with this term is that it is easy to believe that developers deliver functional requirements, and that any “nonfunctional” ones are either already naturally present or are delivered by someone else. Proving that a non-functional requirement is not present is often left to the testing team, who must then prove unequivocally that it is both missing and that remedying the situation requires otherwise unplanned work. Not only is it tough to prove something doesn’t exist, especially when there can be enough differences between test and production environments to create plausible deniability, but also convincing others that work must be done to satisfy the non-functional requirement for delivery must also fight the bias against creating unplanned work that can delay the overall release.

Delivering the ilities that your customers need also faces the problem that their importance and how they are perceived can differ depending on the nature of the domain where the customer uses the service.

Just as air requirements differ between a runner and a deep sea diver, the ilities needed for a service used by an investment bank trading desk are very different from those needed for a free consumer email service, and both are entirely different from those required for a service that manages hospital medical records. Even within a service type, the ilities might vary widely depending upon service hosting and consumption locations (United States vs. Europe vs. China) and customer type (consumer vs. business vs. government).

Such variables can make what is suitable for one customer (like a US hospital group) completely unfit for another (such as a Chinese government-run hospital). As a result, service providers who know little about an industry, market, or customer can easily either fall foul of customer needs or find themselves limited in what/where/how they can deliver or otherwise uneconomically exposed.

Another challenge with ilities is that they require understanding what delivering them effectively means within your service ecosystem. What elements contribute to delivering them? Are they built into the service or require specific actions to maintain? What is your ability to track how well you are delivering them and counteract any events or conditions that might put their suitability at risk? Do the means required to deliver them limit your ability or options available to respond to new problems or opportunities?

All of these can be surprisingly difficult to do, even in less complex, ordered operational domains. Ilities like availability, scalability, security, and data integrity usually depend upon multiple delivery aspects working together in a known and complementary way. Any one problem or misunderstanding, whether from a technical limitation, broken interaction, or even an environmental hygiene issue, can create a cascade of failures that can both hide the cause and obfuscate the severity of the real and damaging impact on the customer.

With all this lack of poor information flow, missing skills, and poorly understood ilities, how can an organization build up enough ecosystem and operational awareness to confidently deliver services that will meet the needs and expectations of customers?

The Path to Eliminating Service Delivery Fog

Images

Figure 8.5
Eliminating service delivery fog takes more than having a fan.

Eliminating service delivery fog begins with first knowing the outcomes you are trying to achieve and the intent behind trying to attain them. This allows you to then identify the service ilities that contribute to achieving those outcomes. With that information, you can then start the journey to identify the relative importance of various sources of information, where it needs to flow, and at what rate.

There are a number of ways teams can try to close this awareness gap, from reaching out to customers to instrumenting services and the dynamics of the surrounding ecosystem to better understand how they are used and align to customer expectations.

However, in order to cut through any customer biases and ensure various teams come to the same aligned conclusions, teams first need manager-level leadership to convey an overall business vision. Rather than telling team members what to do, managers need to instead work with each one to help them identify how their work ought to contribute to that vision. As you will soon see, such an approach helps team members find the right signal in the noise and take ownership of delivering the right solution.

The Role of Managers in Eliminating Service Delivery Fog

Many mistake the role of manager as one geared toward telling people what to do. Some even go so far as thinking it is the manager’s responsibility to tell staff how to perform their work.

Anyone who has been effective at management knows the reality is necessarily quite a bit different. For one, managers are not all-knowing entities. Unless the delivery ecosystem is in the predictably ordered Cynefin Clear domain (as described in Chapter 5, “Risk”), they are far less likely to have both more-detailed subject matter expertise and more-up-to-date details of the service technology stack than their staff. Even if they somehow managed to have both the awareness breadth and expertise depth, not only would the manager be incapable of scaling sufficiently to handle a team of a reasonable size and set of responsibilities, they would create an unstimulating working environment that needlessly wasted the capabilities of their own staff while simultaneously limiting everyone’s ability to grow and improve.

Such a combination is, at best, a recipe for unhappiness and failure.

While managers might not know everything, their position does provide them with three important and unique capabilities:

  • They are part of the organization’s leadership structure, which gives them a seat at the table that can be used as a conduit to capture, convey, and provide clarity around the intent of the business vision.

  • They act as a chief advocate for their team that can help articulate any capability gaps or obstacles that the team needs leadership help to overcome.

  • They can help leadership map delivery work to customer ilities, and communicate any important team findings that might affect the larger strategy.

There are a number of great ways that managers can perform this, one of which is the commander’s intent pattern, as detailed in Chapter 3, “Mission Command.”

Another unique capability a manager has is that they have the ability to get the attention of and interact with peers across the organization. This enables the manager to see across the organization, facilitating information flow by helping remove communication barriers that might distort context or limit team members’ ability to learn. The manager can also provide outreach and foster collaboration between teams.

Finally, managers are well placed to see the overall dynamics of their own team. This allows them to act as a mentor and coach to help the team and its members learn and improve. Mentoring is one important way, some of which is discussed in the “Coaching Practice” section in Chapter 7, “Learning.” Part of this mentoring also includes providing team members a space and air cover to step back from the dynamics of the delivery ecosystem from time to time to experiment and fail in a safe and informative way.

Allowing for failure may sound anathema to management. That might be true if carelessly high risks are allowed. But managers can help define and defend experimentation boundaries that allow teams to safely try new technologies and approaches and allow team members to better understand the delivery ecosystem, innovate, and ultimately to improve. By allowing failures to be tolerated, everyone can openly discuss what happened and why something did not work so that they can come up with a potentially better approach in the future.

Identifying What You Can or Cannot Know

Even though they play an important role in the process, managers obviously cannot singlehandedly eliminate all service delivery fog. The key elements in your service delivery ecosystem that contribute to those ilities that are meaningful or important to your customer not only will span systems, software, people, skills, and suppliers, they also will be affected by the interplay between them. Some, from the software and systems themselves to configuration, access control subsystems, network services, and the data itself, contribute directly. Others, from troubleshooting and recovery tools, skilled staff, and instrumentation systems, are supporting mechanisms that aid finding and fixing ilities that start to drift outside acceptable parameters.

Understanding what these elements are, the relative importance of the ilities they deliver, and how they contribute to them will help you build a map that you can use to learn and improve your ability to deliver them. What is particularly important about building this map is that you will soon realize how much control you have with managing the ilities these elements provide. Elements will fall into the following three distinct categories: the known “manageables,” the known “unmanageables,” and the unknown.

Let’s look at how each category can help you and your team gauge and ultimately shape your organization to best manage your operational risk.

What You Know and Can Manage

Known “manageables” consist of all the elements in your service ecosystem that you know you can demonstrably control that directly contribute to the operational ilities of your service. This is the category where, in a perfect world, all the elements necessary for all critical or important ilities would optimally live.

Images

Figure 8.6
The operational ilities of an On/Off switch should all be known and manageable.

However, being known and “manageable” is only useful if you are able to monitor, respond to, and recover from any events that jeopardize the delivery of those ilities in a timely and effective way. Let’s take the example of transaction responsiveness. Most customers will have expectations regarding what length of delay is acceptable for a transaction. Having known “manageable” elements that you can adjust or change to shorten growing delays is not terribly useful if you are not able to make those adjustments at a speed or in a way customers can tolerate.

Similarly, it is not enough to simply have the element that is directly and knowingly “manageable.” You also need the supporting mechanisms that find any ility delivery problems with that element and correct them to also be responsive enough. This is one of the most common ways that organizations that try to manage ilities fail. Some will put in place elaborate tools and technologies for managing such ilities as scalability and recoverability. Yet I have seen where they will either not have sufficient in-house skills to use them effectively, or split responsibilities across the organization and include teams that are unable or unwilling to contribute effectively to meet customer needs.

It is important to know such elements and how they contribute to the success of the service, know and test their thresholds, track their current state in production, and ensure that any groups that must act when thresholds are approached are capable of doing so in an acceptable timeframe with an acceptable level of accuracy.

The Known Unmanageables
Images

Figure 8.7
It took a while, but Sam’s team finally found a way of demonstrating a known unmanageable to him.

Known “unmanageables” are all the elements in your service ecosystem that you know contribute to the operational ilities of your service but are not under your direct control. The most common of these today are cloud and PaaS services, such as AWS and Azure, that are relied upon for critical infrastructure and platform services. Just as common as these services but often overlooked are components delivered by commercial hardware (such as a provider of network, compute, or storage equipment) or software (such as database, payment, or security COTS packages) providers, smaller but often just as important caching and gateway service providers, as well as providers of externally run build, test, and deployment services.

Knowing their importance and what they contribute is extremely helpful. But the big thing with known “unmanageables” is to minimize their risk to your ecosystem. With systems and software, this might involve building robust monitoring, troubleshooting, and recovery procedures, along with using more standard or proven versions and configurations in places where they provide critical capabilities.

Another even more effective approach is to engineer and operate your services in a way that is intentionally resilient to any ilities issues those “unmanageables” might create. This is very much the thinking behind Netflix’s Simian Army and Chaos Engineering. For Netflix, they knew that hosting significant components of a streaming service on AWS instances could expose them to unpredictable performance characteristics that might cause lags and frame drops that could hurt the customer’s viewing experience. They also knew that engineers have a tendency to ignore or downplay the threat of ecosystem problems, creating software that struggles to deal with the unexpected.

By using tools like Chaos Monkey and Latency Monkey, engineers knew that “once in a blue moon” sorts of problems would definitely occur, forcing them to engineer their software to resiliently deal with them. This results in lower risks to the business and ultimately the more predictable delivery of the ilities the customer expects.

The Unknowns
Images

Figure 8.8
Billy could never be sure what, if any, dangerous unknowns were out there.

Even if you have the ability to manage an ility, it is going to be of little help if you do not know its constituent elements, their state, or how to best operate them to deliver what the customer needs. Ilities of this sort are the unknowns. These unknowns introduce a level of unmanaged risk that can damage or even destroy your business.

Unfortunately, with all the lack of information flow and suboptimal grasp on the service part of delivery, being an unknown is the default state for most ilities. It is also where most service delivery organizations find themselves at the beginning of their service engineering journey. Even if you solve the problem with determining what the customers need, you know that you still have a lot of work to do to close the awareness gap within your delivery ecosystem to deliver it.

Ways the Team Can Eliminate Service Delivery Fog

There are several other ways that teams can, regardless of your organizational structure, enhance their shared situational awareness, all while continuing to learn and improve. Organizing work to make it clearer of what is going on (covered in Chapter 12, “Workflow”), having regular good but lightweight ways of sharing information and ideas (covered in Chapter 14, “Cycles and Sync Points”), and having intelligent and ility-meaningful instrumentation (covered in Chapter 11, “Instrumentation”) are obvious ways to help. There are also ways to restructure governance processes (covered in Chapter 15, “Governance”) and automation practices (covered in Chapter 10, “Automation”) so that they enhance rather than hinder awareness.

Before getting into specific patterns and practices, it is good for you and your team to capture and close any maturity gaps that can hinder information flow and delivering effectively. This is discussed in the next chapter, along with the first of the “duty” roles that a team can use to facilitate operational information flow and spot improvement areas within and between teams. I have used the pattern myself in a number of organizations and have found it a great way of breaking down silos while simultaneously improving cross-team technical coordination and expertise.

The second “duty” role, the Queue Master, is far more critical. It is covered in Chapter 13, “Queue Master,” and is a means to allow each team member a chance to step back from their normal work to improve team situational awareness and spot learning and improvement areas by monitoring the work flowing through the team.

Summary

Effective service delivery requires that teams have the right amount of information flow to know the intent of what is trying to be accomplished, the ilities that customers need to be present, and the elements within the ecosystem that can contribute to delivering them. By restructuring how people look at their roles and larger ecosystem, teams can reduce the service delivery fog that so often obscures their path to success.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.38.117