Chapter 16. SLO Advocacy

They always say time changes things, but you actually have to change them yourself.

Andy Warhol

Previous chapters have explored how to get buy-in from your organization when adopting SLOs, and the importance of building an SLO culture. By now, you understand why your organization needs SLOs and how much they will impact your engineering processes and your users. You can’t wait to start advocating for SLOs across your organization!

But wait—there are three questions you need to answer before beginning:

  • Do you have leadership buy-in on implementing SLOs in your organization? It can be helpful to have an executive sponsor with a vested interest in SLO implementation who will be able to support you and unblock you in the case of conflicts of interest.

  • Does your management chain agree on expectations and the time investment required to drive SLO adoption? Being an SLO Advocate will probably be a full-time job for you for the first few months (or even longer if you’re in a large organization).

  • Are you ready to have a horizontal role that impacts an entire organization? Such a role will require skills in multiple domains, from communicating with senior leadership and stakeholders, to writing documentation, to analyzing data and building reporting, to reviewing monitoring implementation, to delivering training, and more.

If you answered yes to all three questions, congratulations: you’re ready to be an SLO Advocate. But what does that mean?

Your role is to help your organization successfully implement SLOs by doing all of the following:

  • Cultivating a deep understanding of SLO implementation for different types of services (see Chapter 4)

  • Understanding your monitoring platform, statistics in general, and your organization’s data visualization platform (to do instrumentation and produce dashboards and other outputs for your metric signals)

  • Most importantly, motivating and inspiring people to reach beyond their current role and scope

Your people and leadership skills will be critical during this journey: you will need to convince others of your vision, teach them what they need to know, and generate positive energy to inspire them and drive successful SLO adoption.

It will help a lot if you have experience designing training materials and delivering technical training. If you don’t, you might want to team up with someone who does. If teaming up isn’t an option, however, don’t worry: this chapter gives you some tips on how to do all of this successfully. It focuses on activities, artifacts, and processes for SLO adoption that have worked in other organizations.

Note

In sum, becoming an SLO Advocate is an opportunity to improve your leadership, engineering, and project management skills, while creating positive change in your organization.

As with many things in life, and especially engineering projects, it’s best to start small and iterate as you go. We can break this journey into three phases: Crawl, Walk, and Run.

Throughout each phase, make sure you are soliciting feedback to understand whether what you’re doing is working or you need to change your approach. You’ll face some challenges (we’ll talk about those in more detail in “Learn How to Handle Challenges”), and at times you may even think you have failed. But remember: you learn more from failure than you do from success.

Crawl

The Crawl phase of your SLO advocacy journey is where you’ll build the foundation of your program. You’ll educate yourself, create artifacts to help you spread your message, start to connect with leaders and teams in your organization, and run your first few training sessions.

Do Your Research

First, you need to become an SLO expert yourself. The fact that you’re reading this book shows that you’re on the right path! You should also read the chapters on SLOs in Site Reliability Engineering and The Site Reliability Workbook.

There are many online resources you can use to deepen your knowledge, from technical conferences to written works to online courses. Choose what works best for you. Define your learning process in advance and track it as you would any other activity. Learning without a timeline and without clear outcomes may become demotivating.

We also recommend creating a working example of a service with SLOs while you’re learning. Having a concrete, small example to apply each new piece of knowledge to will keep you focused and help you retain what you learn.

Prepare Your Sales Pitch

It usually takes me more than three weeks to prepare a good impromptu speech.

Mark Twain

Imagine you meet the CEO of your company in the elevator, and they ask you what you do. With only a few seconds of their attention, what would you say?

Spend some time preparing your SLO “elevator pitch,” and remember to adapt it to different audiences. You should be able to articulate the value of SLOs and why others should care about them, and you should be able to do this when speaking to people with all different perspectives.

What do your engineers care about?

Engineers usually appreciate understanding if their service is working well enough, so shape your conversations with this goal in mind. You could talk about real-time measures of service reliability and the ability to get insights into the health of service dependencies; about correlating different signals and using that data to detect service degradations; and about embedding SLI metrics into your CI/CD pipeline to detect regressions and perform automatic rollbacks. Don’t forget to mention how SLOs and error budgets help with assigning the best priorities to service incidents and improve alerting efficiency, both of which are especially critical to on-call engineers.

What do your company executives and business partners care about?

Here you can talk about real-time measures of user experience and satisfaction, or about a data-driven approach to service reliability and the ability to prioritize efforts that will improve user satisfaction in the areas where it really matters. Mention that SLOs will allow them to identify the right investment opportunities, because “reliability is the most important feature of any system” and the only way to gain your users’ trust is to provide reliable and secure systems.

Note

Feel free to pull from this book in your sales pitch. Identify the language that will resonate with your organization, and borrow it. A strong message that works for both audiences is this quote by Peter Drucker: “If you can’t measure it, you can’t improve it.” Above all, SLOs are new and better measurements of your service from your users’ point of view.

Create Your Supporting Artifacts

Having done your research, you have a good understanding of SLOs and are confident about defining them for a simple service. You’re also prepared to talk about the value of SLOs with anyone you encounter. Next, you can focus on creating artifacts to support your engineering organization as it adopts SLOs. These artifacts fall into two main categories: documentation and training materials.

Don’t forget to define where all your artifacts will live—for example, a wiki paired with a code repository—and make sure they’re discoverable and easy to navigate to. The biggest mistakes we see across engineering organizations are not taking the time to create well-structured and discoverable technical documentation, and not demanding that documentation undergo the same quality review process as code. Don’t underestimate the power of documentation to support you and your organization during SLO implementation.

Note

You might be tempted to use this book as your documentation, and just ask everyone to read it. Some people in your organization may do that, but many will not. Even among those who do read it, people will take away different things from the book. It’s more effective to create your own documentation, tailored to your organization’s needs, than to expect everyone to read an entire book. See the next section to learn what types of documentation you need to increase your chances of successful SLO adoption.

Documentation

Your goal is to break down SLO creation into three phases: define the SLO, collect SLIs, and, later, use the SLO. Here is a list of the documentation we recommend at this stage:

One-page strategy document
The one-page strategy document will be the most important document in the Crawl phase. What are you trying to accomplish? Why? How will you do it? This will be the very first document you share with people when they ask, “What is this effort all about?” You should make it short enough for anyone to read in less than 10 minutes. It’s critical that you get this document right. Use this book as a resource to help you articulate why your organization needs SLOs: what it will get out of creating SLOs, and how SLOs will improve service reliability for your users and help your engineering teams. Make sure you review this document with your leadership and have their sign-off and total support for the strategy you plan to communicate across your organization.
Two pages defining SLOs (high level)
Next, you’ll need a more detailed (but still brief) document that explains what an SLO is, gives examples of good SLOs, and tells the reader how they can get started. You don’t want to scare your readers by asking them to read an entire book about SLOs just to understand what an SLO is. Make it easy for engineers to get an idea of how to implement SLO-based approaches and try to build their interest.
FAQ

Collect a list of the questions you expect people to ask as they begin their own SLO journeys, and compile them into an FAQ document. To start with, you might include questions like:

  • What if my “user” is another service? Do I still need to care about SLOs?

  • What if my service’s dependencies don’t have SLOs?

  • How many SLOs should a service have? How many SLIs?

Defining SLOs for your service, step by step
You’ll need a document that explains, step by step, how someone in your organization can define an SLO (the first phase of the SLO creation process). Don’t talk about instrumentation and metrics collection here; focus on the high-level process. You might want to share an SLO definition template that teams can use.
Instrumenting your service to collect SLIs
As a follow-on to the previous document, this document will give step-by-step guidance, with examples, on how to instrument a service to collect SLIs (phase two). You can be very specific here and look at the monitoring platform your organization uses to give examples of SLI instrumentation for different types of services. For example, how would you collect latency data and translate your metrics into SLIs using percentiles? How would you instrument a pipeline service to collect SLIs? Give as many examples as you can and provide ready-to-use code snippets, making it easy for engineers to move forward with the monitoring instrumentation step of the journey.
Use case
If you’ve already implemented SLOs for any of your services (or for the example service you developed while doing research), write up the details in a use case document to give your SLO early adopters a concrete example of how this is done.

Training

To supplement your documentation, we recommend developing a few training programs creating the following trainings that you can run at this stage. Here are some ideas:

30-minute overview of what SLOs are
Capture the interest of the audience, and use this time to inspire and motivate them to want to learn more. You can also record this training session and distribute it across the organization so that people can watch it on their own time as a first step of SLO onboarding. The overview should mostly focus on defining concepts and outlining the value of implementing SLOs.
Hands-on workshop on defining SLOs for an example service
Choose one service that will be familiar to all attendees, such as a common dependency, or construct a hypothetical system or service that’s easy to understand, like a mobile game or an online store. Provide your students with the high-level service architecture and data flow diagrams, describe the service’s main functions, and give examples of usage patterns for that service. Organize sessions where small groups of four to five people define SLOs for the service and then explain their approach to the other groups. This type of workshop is more suited to being held in person (or at least live over video chat) as a collaborative learning experience, rather than recorded.
Hands-on workshop on instrumenting a service to collect SLIs

As a follow-on, walk your audience through your sample service and the instrumentation you implemented to collect SLIs. Building a hands-on lab for your audience may be time-consuming, but depending on the complexity of your platform, building it now may save you time in the future by establishing consistency in how to instrument SLIs.

Tip

Make sure to give your attendees a break between hands-on workshops.

Collaboration-based training

People enjoy (and get more out of) real-time, collaboration-based training than watching recordings. In a large enough organization, you may be able to train a few other teachers to help with this. Adding new teachers also provides advantages such as being able to cover more time zones, if a company is distributed, and helping your colleagues in their career growth. But even with multiple teachers, if you need to reach thousands of people it won’t be feasible to train everyone in person, so you’ll need to scale your training by leveraging recordings and online collaboration.

Usually, if the training is lecture-based (that is, for sharing definitions and concepts), a recorded session works great. But if you need to teach people how to think in a specific way and to solve a problem, a collaborative environment will give you a much faster outcome. The 3 hours of collaborative SLO workshops that we attended at a technical conference gave us the same amount of (or even more) experience defining SLOs as 40 hours of watching recordings and reading on our own. In person or online, you can build a collaborative environment by organizing your students in small groups of five or six people, assigning a mentor to each group, and defining the rules of engagement and the desired discussion outcomes. If you are doing this online, consider using a digital whiteboard and video chat to encourage more effective collaboration between group members.

Run Your First Training and Workshop

Tell me and I forget. Teach me and I remember. Involve me and I learn.

Benjamin Franklin

Most likely, your first training and workshop will not be perfect. This is to be expected: remember that one of the foundations of an SLO-based approach is acknowledging that nothing is ever perfect, and this extends to your SLO advocacy efforts as well.

Make sure you create a survey to collect feedback from attendees and learn how you can improve the training. It’s also a good idea to deliver your first couple of training sessions to a friendly group of people who already have an idea of what SLOs are, and who can give you candid feedback on what’s missing from your training. As with everything in life, you may get it wrong at first. Listen to your attendees, iterate, and improve. Either online or in person, ensure that you create a collaborative environment among your students so they can practice discussing SLO-based approaches with each other while going through the exercises.

Note

In the first SLO workshop w ran, the hands-on exercise was to define meaningful SLOs for a simple request and response service. After a few workshop sessions, we got clear feedback: the service type we were looking at was too easy. Attendees wanted exercises on working with more complex types of services—storage services, pipelines, continuous compute services, network services, serverless services, and so on. They also asked for a hands-on lab on instrumentation for SLIs, with examples of each service type for them to pick from. We took this feedback on board and added more and more service examples to the workshop over time.

Implement an SLO Pilot with a Single Service

To bring SLO adoption to the next level, you need an example of a real service that has implemented SLOs and can show how much those SLOs have impacted service reliability.

Choose one of the smaller services in your organization and develop SLOs for it. Request and response services without many dependencies are a good option. Define an SLO for the service, instrument it to collect SLIs, and build visualizations for the SLIs that demonstrate the value of SLIs and SLOs in improving service reliability. Work with the engineering team that owns the service to gather their feedback, make any necessary adjustments, and help them start to use SLIs and SLOs in their engineering practices. Then, crucially, document the pilot as a case study and add it to your documentation, so that other teams can read about the experience.

Spread Your Message

The single biggest problem in communication is the illusion that it has taken place.

George Bernard Shaw

Your next goal is to ensure that your organization is aware of the work you’re doing, the push toward SLO implementation, and any new content you’re building.

Talk at internal meetups or conferences. Organize engineering review sessions to deep-dive into SLO implementation examples. Use every opportunity to talk about SLOs at your internal community events. You need to make sure people know what your role is as SLO Advocate, and how they can find you.

Publish the schedule for your next training sessions, outline who the experts available to help are, and share how people across your organization can get in touch. Understand what channels your organization uses to share information. Make sure you have an SLO landing page with an easy-to-read shortlink and share it over and over, until everyone knows where to find information on SLOs.

Here are some ideas for what information to publish on the landing page:

  • Your training schedule

  • An email distribution list that can be used to ask SLO experts questions

  • A list of Slack or Teams channels dedicated to SLOs

  • An SLO newsletter, and a distribution list people can subscribe to

  • Your office hours schedule

Learn How to Handle Challenges

As an SLO Advocate, you are the agent of change. You may encounter some challenges in this journey, and facing them with an open mind and positivity is critical to your program’s success. Remember, we learn more from failures than from successes! Here are a few of the issues you may run into, and our suggestions for dealing with them.

First of all, your role may be misunderstood. Some teams will expect you to do SLO implementation for them. You can overcome this obstacle by setting the right expectations, up front and clearly, when you engage with teams. You might also encounter people who simply don’t know what you’re doing at all. You can add clarity to your role by making sure you have a clear backlog and you are tracking SLO advocacy work in the same way you would track your engineering work, breaking it down by artifacts, activities, and other deliverables. Tracking your work will also help you do retrospectives for this program and communicate your deliverables and timeline clearly.

Second, you may encounter some resistance while trying to implement changes to the processes and practices used by your partner teams. People are naturally resistant to change, and you should be prepared for pushback as a result of this. Breaking the changes into smaller iterations and making them easy to implement will help you to overcome change aversion. Try to use your earliest successes to build confidence in what you are doing and turn those teams into SLO evangelists—it may help their career growth and will help your organization to adopt an SLO-based approach faster.

Third, as we mentioned before, you may encounter teams that are overloaded with work or that have well-defined priorities that are not SLOs. In these situations, you can work with leadership to see about prioritizing SLO work against the team’s other responsibilities.

Walk

In the Crawl phase of your SLO advocacy journey, you laid the foundation for SLO implementation in your engineering organization and had some initial success. Now, in the Walk phase, you’ll expand your work to other teams and continue building a library of examples. You’ll also make sure your feedback loops and internal communication methods are working well, expand your training program, and revisit how much time you spend working with each team.

Work with Early Adopters to Implement SLOs for More Services

By now, you already have one or two services piloting SLO implementation and working on incorporating SLOs into their engineering practices, aiming to improve service reliability. You need to move further, but you can’t tackle all the services in your organization at once. (I’m assuming you have more than three services in your organization; if you don’t, congratulations, you may be very close to completion!)

Choose a number of services to implement SLOs next that you can give white-glove assistance to. It’s a good idea to pick a few services of different types (request/response, pipeline, continuous compute, etc.). It may be tempting to keep looking at request and response services, but you need to build a body of real-life SLO implementation examples for different service types.

Variety is not the only criterion you should use when choosing to which services you will give white-glove assistance. Other things to consider are:

  • Level of complexity (look for variation here too)

  • Amenability of the teams

  • Criticality of the service to the system

  • Closeness to human users

Schedule weekly meetings with the owners of each service to assist them on their SLO implementation journey. Define a timeline for completing the different SLO implementation phases, and keep each team accountable to ensure that this work doesn’t get deprioritized.

Celebrate Achievements and Build Confidence

We can’t stress this enough: celebrating achievements will bring positive energy to your mission and accelerate SLO adoption. If teams are not moving forward fast enough (or not moving at all), try to understand what challenges they are facing. It’s not always aversion to change or conflicting priorities; teams may find that in order to implement SLOs, they first need to make fundamental changes to the way their service is built.

To help build confidence about SLO implementation for those services, you could try some of the following:

  • Start with something as simple as possible, even if that means the SLO isn’t as useful as intended—for example, a single, easy endpoint or a subset of the user flow.

  • Try something like measuring through synthetics or a dedicated client instead of the service itself.

  • If the issue is with some other piece of knowledge that is missing from the team, try to get someone knowledgeable in that area to pair with the team. Gaps in knowledge could be in middleware patterns, particular frameworks or tooling, and so forth.

Create a Library of Case Studies

By now, you hopefully have multiple teams interested in implementing SLOs that are trying to get your assistance with that work. While continuing to work with your handful of early adopters, you need to make sure all these other teams have a good level of support, without overloading yourself. Your best friends in this phase will be well-structured documentation and a set of internal case studies teams can use to learn more. Being able to follow the example of your early adopters will help other engineering teams with their own SLO implementation, making it easier for you to scale this program.

If you find yourself receiving a lot of requests from teams to meet with you to ask questions about SLOs, define your boundaries. You can’t meet with every team individually; ask them to read the documentation and attend your office hours. Often, these teams will have one or two questions that can be answered in a few minutes in office hours, and there’s no need to schedule a separate meeting with them. The most frequent question will likely be, “My service looks like this; do you have any examples of SLO implementation for this service type?” Having case studies for different service types from your early adopters will help many other teams.

Scale Your Training Program by Adding More Trainers

By now, if your organization is small, you might have trained everyone. But if you’re dealing with a medium or large enterprise, with teams distributed around the country or globe, it’s time to scale your training program. Train other people to deliver the training sessions and workshops you’ve created. If your organization has an internal training team, they might be able to support you in scaling your training. You might even consider handing off training to that team completely, and focusing on other aspects of SLO advocacy. Otherwise, seek out passionate individuals in the organizations you’ve been working with who understand SLOs well, and engage them to scale your training program. Handing off your training work to others will allow you to focus on the other important tasks in this phase.

Scale Your Communications

Earlier, we mentioned that as your advocacy work ramps up you will no longer be able to spend as much time working with every single team. You need to scale how you engage and communicate with teams. Some activities and artifacts you might consider at this stage include:

Office hours
If you haven’t already, schedule a periodic meeting (say once a week, for one hour) where you and a group of SLO experts from your early adopter teams will be available for anyone to pop by to ask questions. Make sure the meeting invite is broadly distributed, that people can join remotely, and that everyone knows the purpose of the meeting.
FAQ document
Capture all the questions you’re getting from teams and consolidate them in a frequently asked questions document. Then, when a repeat question comes up, you can simply send teams the link to that document.

Not everything can be scaled, unfortunately, so there will be exceptions when you may need to provide 1:1 consultancy to a team with in-person engagement. We recommend limiting this to large or core services that carry significant complexity, with multiple upstream dependencies.

To deal with time zone challenges and maintain your work/life balance, make sure you have at least one SLO expert per region who can provide support in local time. Leverage your early adopter teams to support other teams working on SLO implementation, too.

Communicate, communicate, communicate. Keep yourself accountable, and keep teams implementing SLOs accountable as well. These teams should capture their SLO work in the same work-tracking platform that they use for engineering work (Jira, Bugzilla). Build dashboards reporting on SLO adoption progress. Communicate as much as you can!

Run

When you get to the Run phase, SLO implementation is going viral and everyone is at some stage of SLO maturity. (If this isn’t the case, you might want to consider going back to the Crawl or Walk phases and continue iterating on them until you achieve enough momentum to move to the Run phase.)

In the Run phase, your role is to use what you’ve learned so far and keep improving, by sharing your library of case studies, creating a community of SLO experts, driving platform improvements, and improving your advocacy process. Remember that defining and implementing SLOs is just a first step toward improving reliability. The game changer is actually using SLOs as part of your engineering practices to drive service quality and operational excellence.

Share Your Library of SLO Case Studies

Continue the work you began in the Walk phase, adding more case studies to your library. Pick the most successful services. Having concrete examples of success will help convince those who still don’t fully believe in SLOs that they, too, should care about them.

Don’t focus only on documenting the use case. Make sure you also reference code and give specific examples of reliability improvements that the team saw over time. Some improvements may be purely operational: for example, noise reduction or toil reduction. Other improvements may be less data-driven; for example, “Before SLOs, we didn’t know how users perceived service reliability, but now we do” or “Before SLOs, we didn’t have enough data to prioritize work that will have a greater impact on service reliability, but now we do.” (Other chapters of this book will help you determine what to focus on when writing these case studies.)

Create a Community of SLO Experts

Build an internal community of SLO experts, pulling from your early adopters, your other trainers, and anyone else who is passionate about SLOs. SLO experts can support engineering teams across your organization by answering questions and helping with hands-on SLO implementation. Create an email distribution list for these SLO experts, or use other internal communication channels (for example, chat) to give teams an easy way to reach them.

Continuously Improve

As you go on, continue to improve your platform, review your SLOs, and update your documentation.

Based on what you’ve learned so far, you may have discovered that you need to make some changes at the platform level, or that you need to rethink your observability strategy or reporting toolset. Work with your internal partners on defining those platform improvements.

Even if you’ve had some initial success implementing SLOs for a service, you should review them again a month or two later. Remember, SLOs are a process, not a project. Services evolve and platforms change. What worked well before may no longer be relevant. Use your SLO maturity framework to review SLOs for specific services periodically.

Other ways you can keep improving include:

SLO reviews
Pick a service and review its SLOs. Suggest improvements, and document the team’s implementation efforts as a case study.
Quality of service reviews
Assuming your team has periodic service health reviews, make sure SLOs are one of the discussion topics.
SLO deep dives
Periodically do a deep dive into a team’s SLO implementation process, inviting other teams to observe.
SLOs driving discussions
Review how SLOs are being used to drive critical discussions to improve service reliability.

Lastly, review your existing documentation periodically to make sure it’s up to date. You might even try defining a “freshness” SLO for your documentation and making sure you maintain it above a certain level!

Summary

Progress is impossible without change, and those who cannot change their minds cannot change anything.

George Bernard Shaw

This chapter looked at the different phases of the SLO advocacy journey, and the recommended goals and tasks for each one. Your role as an agent of change, seeking not just to implement SLOs in your organization but also to build an SLO culture, is one of the most challenging roles. To help ensure your success, make sure you have executive support and surround yourself with people who believe in your mission and who will keep you accountable. Iterate on everything, overcommunicate, and don’t forget to celebrate successes, no matter how small.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.44.23