Chapter 7. SRE Without SRE: The Spotify Case Study

Many people are surprised that Spotify does not actually have an SRE organization. We don’t have a central SRE team or even SRE-only teams, yet our ability to scale over time has been dependent on our ability to apply SRE principles in everything we do. Given this unusual setup, other companies have approached us to learn how our model (“Ops-in-Squads”) works. Some have adopted a similar model. Let us tell you a bit about how we came to this model and how it works for us so that you can see if a similar idea would work for you.

First, we need to share a little context about our engineering culture: at Spotify, we organize into small, autonomous teams. The idea is that every team owns a certain feature or user experience from front to back. In practice, this means a single engineering team consists of a cross-functional set of developers—from designer to backend developer to data scientist—working together on the various Spotify clients, backend services, and data pipelines.

To support our feature teams, we created groups centered around infrastructure. These infrastructure teams in turn also became small, cross-functional, and autonomous to provide self-service infrastructure products. Some examples of these are continuous integration, deployment, monitoring, software frameworks, and guidelines. The vast majority of the SREs at Spotify work in these teams, using their skills and experience to make the production environment reliable, scalable, and easy to work with for our feature teams.

However, some concerns of SRE are cross-cutting and can be addressed only from a central perspective. This includes outages caused by large cascading failures, teaching best practices for deployments, incident management, or postmortems. Our SREs organize themselves in working groups across the company, but these working groups are not exclusively engineers with the SRE title. Only half of the engineers on our central escalation on-call rotation (internally referred to as Incident Manager On Call, or IMOC), for instance, are SREs; the rest are engineers with various roles.

Why have we organized ourselves in such a way? What are the benefits of doing so, and what are the trade-offs? In the following sections, we discuss how Spotify built its SRE organization from a tiny startup with servers in a Stockholm apartment to the large global company that it is today. We highlight how Spotify has made operations the default among all engineers by providing a frictionless development environment and a culture of trust and knowledge sharing.

Tabula Rasa: 2006–2007

Prelude

We’ll talk a bit about how we began to incorporate an operations focus in our early history, including:

Ops by default

Unintentionally bringing in an operations focus from the beginning has influenced our engineering culture, proving to be beneficial in the future.

Learning to iterate over failure

Although we might have had the foresight with operations, we were not impervious to the common traps that startups fall into.

One of the curious aspects about the Spotify story of operations and SRE is how the initial staffing of the six-person company included an operations engineer.

Many startups add an operations-minded person only after the first customers begin using the service. The unfortunate ops engineer might then discover a backend hobbled by undocumented scripts; services running inside of screen sessions; lack of backups; single points of failure; and layers of unfinished, good intentions. From this point on, the ops engineer might constantly be playing catch-up, trying to fight fires while keeping abreast of new developments.

Including an ops engineer from the beginning, ensured that operational soundness, by extension, was included in discussions as an equal partner, not left until the last minute. From the get-go, dev and ops worked side-by-side toward our common vision of streaming music to the entire world. This initial way of working led to a culture of collaboration and trust, which continues to thrive today.

Our backend, as originally envisioned, consists of many small services—a pattern now called microservices—which together provide the content for the Spotify clients. The microservices themselves are programs that do one thing and do it well.

During these first two years, work was distributed between dev and ops as follows: developers took care of the business logic in the form of client features or backend services, whereas ops owned everything else, including detecting problems in production as well as taking care of office technology.

This was a startup, and as you might imagine, almost everything was done manually, like deployment of backend services and desktop client rollouts. As Spotify grew in the beginning, tools were added to lessen the load, but these tools only helped with manual tasks; humans still needed to make all the critical decisions.

Key Learnings

Some of our key learnings from this early period were:

  • Including an ops engineer from the get-go influenced the engineering culture and proved highly beneficial as we grew.

  • You should introduce an operational mindset into the engineering culture as early as possible.

  • Aside from business logic architecture, ensure that an infrastructural architecture is in place, as early as possible.

Beta and Release: 2008–2009

Prelude

In this section, we’ll talk about how our “Ops by default” philosophy shifted and how that links up with our core engineering values:

Ops by default

Bringing in operations into regular discussions of scalability and reliability with developers from the start was pivotal to creating the foundation of our engineering approach. Operations was a cornerstone in every engineer’s work, whether it was in developing and maintaining services, or in the intrinsic urge to help improve our infrastructure.

Core engineering values

Innately trusting engineers set the foundation for one of the most prevalent principles we continue to have in the tech organization: autonomy.

Spotify went into “invite-only beta” in May 2007, and a premium (and “freemium”) option was released in 2008. During this period, Spotify experienced its first real dose of the dreaded problems that come with rapid user growth. At scale, pain points that are rare and mostly theoretical become visible. They applied not only to the backend services, but all the way down the stack. Here are a few examples:

  • The backend services had capacity-related outages during peak times.

  • There were disk I/O performance issues when the battery of a RAID controller failed.

  • A ZFS bug caused the servers responsible for critical data to become unresponsive under high utilization.

  • Racking and stacking servers was painfully slow.

Adding to the problem, the number and complexity of the backend services rose as Spotify developed new features. With time and a lot of effort, each of these technical issues was solved, and the backend continued to scale with the number of users.

However, the way in which ops engineers initially worked did not scale as well as Spotify did. There were three ops engineers at the time who owned all of the following:

  • Keeping an eye on all the services that were handed over to ops from dev

  • Maintaining the underlying Linux server configuration

  • Responding to service disruptions

  • Ensuring incident remediations were prioritized

  • Keeping up with system security

  • Maintaining networking equipment and configuration

  • Releasing new Spotify desktop clients and monitoring for deployment failures

  • Data center management including procurement, vendor relationships, and provisioning

  • Maintaining and expanding storage

  • Maintaining our data infrastructure and pipelines

  • Owning office IT (network, firewall, staff computers and peripherals, Google apps, LDAP, Samba, Kerberos, AFS, wiki, printers, etc.)

  • Serving as a support center to all of our colleagues; for example, helping with laptop setup, explaining networking configurations and how TCP/IP works, or helping our peers work with system monitoring graphs

There was quite a lot to do, but, at least initially, operations still had a little time left over for deep discussions like the future of IPv6, or bickering over the principles and pragma of the different free software licenses. However, as the different scalability issues hit us in increasingly large waves, it became clear that operations would fail spectacularly at keeping Spotify running if something wasn’t done.

Bringing Scalability and Reliability to the Forefront

Until this point, there were many people at Spotify who wanted to help with scalability and reliability-type problems, but only the ops staff was ultimately accountable for the work. But the ops team had too many responsibilities to juggle and needed to distribute those to the rest of the organization.

A weekly meeting was introduced during which all relevant backend developers and at least one ops engineer discussed services that were dealing with scalability issues. Every week, one or more services were discussed, and developer stories to scale up a system would usually trump feature work. Through this meeting, the ops engineers got some indication of which types of servers (high in disk space, CPU, or memory) was needed for the next procurement purchasing cycle.

The flavor of this meeting changed over time as dev and ops worked together. A single service’s scalability and reliability issues involved more and more dependencies on other services, so looking at each service in isolation was not enough; the focus of the meetings began to shift to the backend ecosystem as a whole.

Although we didn’t know this at the time, a transformation was taking place in our organization. Gradually, the sense of responsibility for ensuring system health was moving from being an ops-only job to something shared by all engineers working in the backend, regardless of role.

At this point in the company’s history, some of the original backend developers had root access. Seeing how useful this was, we gave root access to all developers who needed it. This level of access was unheard of in most traditional companies. Since the advent of DevOps, this is now common practice, but it wasn’t at the time. Yet it never occurred to us to do it differently; we innately trusted our engineers. The company was young, and we were united in the belief that we would take over the world of music.

By the time Spotify went out of beta, a new service would go from idea to production in the following way:

  1. A developer would write and locally test the service.

  2. The developer would then ask ops for one or more production servers.

  3. After the servers were provisioned, the developer would log in to the host and configure the service and its dependencies.

An unintended benefit of this new flow was that developers could now understand how their services behaved in the wild better than before. The entire problem of, “Well, it worked on my computer” was bypassed. Developers could now see how their code behaved in production, view logs as they were produced, trace running processes, and even modify the operating system if necessary.

Looking back, we were unintentionally disassociating the specialized role of the SRE from the individuals in ops into a set of skills and responsibilities that could be transferred onto anyone who was capable and motivated. Over the years to come, we reused this strategy multiple times with increasing scope and depth.

And this worked—for a time.

Key Learnings

Our key learnings from this period of high growth were:

  • Make operations part of the solution life cycle. Scalability, reliability, maintainability, and other operational qualities should be discussed and addressed as early as possible.

  • Granting privileged access (e.g., root) to everyone who needed it removed friction, unblocking and speeding up our iteration cycles. Reliability is about achieving trust: start by trusting your own.

The Curse of Success: 2010

Prelude

In this section, we’ll talk a bit about how we had to shift our operations’ mentality as we grew:

Ops by default

Introducing the roles of dev and ops owner helped us perpetuate our natural inclination to be mindful of operations in the context of feature development.

Iterating over failure

We could not scale ourselves as an operations team fast enough, and therefore had to divest some of our responsibilities.

Spotify continued to grow in popularity during 2010. This was reflected externally in an increasing number of registered users and concurrent active users, and internally in the number of features and corresponding backend services as well as more Spotify staff.

This twin growth conundrum—a sharp upturn in both users and staff—hit the operations team very hard. Despite growing from three to five people, ops was facing a perfect storm. The increase in users led to more pressure on the backend, magnifying the risk of edge-case failures, and ultimately leading to more incidents and cross-service failures. At this time, the increase in Spotify staff was evenly split between technical and nontechnical, resulting in a higher demand for quality support with shorter lead times.

In addition to these factors, with more skilled developers, features were being churned out much more quickly. In 2009 and 2010, Spotify released five more clients: our users could now access Spotify on an iPhone, Android, BlackBerry, S60 Nokia phones, and libspotify (a library enabling third-party developers to access the Spotify backend through their own applications). The backend saw more feature work as well, but the primary focus of 2010 was scalability and stability. Every system needed some sort of revamp, be it modification to enable easy horizontal scaling, a caching layer, or even complete rewrites.

Each frontend and backend change to the Spotify ecosystem introduced new bugs, revealed new bottlenecks, and changed how the system as a whole behaved. Adding a new feature in the client caused unmodified and historically stable services to experience sudden pressure due to changing user behavior patterns. This in turn often led yet other systems to topple in a domino-like fashion. Complex systems are unpredictable and difficult to manage, and, in our case, only a handful of people knew how the entire ecosystem fit together.

Responding to unforeseen incidents became more and more of a full-time job. When Spotify experienced periods of backend instability, developers would often show up at the desk of an ops person, needing critical support. This prevented ops from creating tools to simplify and automate work. In addition, our model of enabling the backend developers with maintenance wasn’t working as well as it should have. Although developers continued to maintain their systems, pressure was steadily building to deliver features, and the informal and best-effort agreement of upkeeping their services suffered, often resulting in out-of-date components, bitrot, and security issues.

In the meantime, our developer colleagues were dealing with their own organizational problems. What was originally a small team of developers with a shared understanding on how to collaborate to deliver code grew to become multiple development teams. In this new setting, developers found that collaboration between teams was difficult without agreeing on things like sprint commitments. As a result, teams felt a greater sense of urgency to deliver what they committed to. An unfortunate side effect of this was that many developers became frustrated by not being able to fulfill their informal agreement with us to maintain their services.

At this point, operations became increasingly aware of how the success of the product translated into countless nights by the laptop, manually catching up with failing systems struggling with the load. All of this led to experimenting and, ultimately, adapting to a new reality.

A New Ownership Model

We needed to clarify responsibilities across our operations and developer roles.

The dev owner role

In early 2010, we formalized the unofficial, best-effort agreement between ops and dev by introducing the dev owner role. Each service had an owner who worked in the feature team itself. Here’s what the dev owner’s responsibilities were:

  • Keep operating systems up-to-date with latest updates

  • Think about scalability of the service on our ecosystem

  • Ensuring development would be scheduled to keep up with growth

Feature developers were also allocated one day during a sprint—a “system owner day”—when they were given dedicated time to maintain, upgrade, and generally nurture their services. The dev owner for each service was expected to call on us for help if needed. This eased some of the pressure on ops. This was not a controversial change, because many backend developers were already motivated to maintain their own services; this now ensured that they had the dedicated time to do so.

The ops owner role

The dev owner role helped us to spread the responsibilities of basic maintenance onto the largest group of engineers at the company, but if something went seriously wrong, ops was still accountable.

However, there were many systems and still only five in ops. Because some of the more critical services needed more attention than others, we assigned an owner from ops to each of these important services. This way, even though every service had a dev owner, the significant ones had both a dev and an ops owner, enabling two people to focus on the services’ health.

Within the ops team, everyone was an owner of at least one service. Ops was expected to know the idiosyncrasies and common failure modes of any of our services and be ready to be called at any point, day or night, to restore the service back to health. When redesigning a service, the dev and ops owner would often work together.

Formalizing Core Services

Because our backend services were deeply interconnected with one another, most incidents affected multiple services; as a consequence, often one system needed to be sacrificed to save another. One of the hardest things to do was make a decision in the middle of the night, alone and sleep deprived, that might adversely impact the entire backend ecosystem. Each of us asked ourselves multiple times every week: “Am I making the right decision? What if I don’t have the complete picture?” This made work during the night very draining.

When we finally identified this as a problem, a hierarchy of services was defined: those that were critical, those that were important, and those that could wait for free cycles. Out of this process came the “core services” concept. Because Spotify’s purpose was to enable people to listen to music, a core service was defined as any backend service that was in the critical path from a user logging in to client-side audio playback. This amounted to a handful of services, residing on a small cluster of servers. In addition to the core user-facing services, the infrastructural systems, such as our border routers, backend network switching, firewall, and DNS, were deemed critical.

By making these designations, we lessened the burden of nighttime work, and could deal with decision-making in a healthier way.

Blessed Deployment Time Slots

Because ops had yet to formalize an on-call procedure, we were careful with making deployments during off hours. The norm was to make sensitive deployments during work hours when many people were available to help out if and when something went terribly wrong. This applied to all of our deploys with the exception of changes to network infrastructure, which was done when most of our users and employees were asleep.

Feature teams were encouraged to avoid any deployments on Fridays. This allowed us to have an honest chance of resting during the weekends.

On-Call and Alerting

Up to this point, everyone and no one was on call. Ops had no system in place to automatically send alerts on system failure. If something went down in the middle of the night, it was left broken until the morning when someone checked the graphs and reacted. In the event of an outage we hadn’t noticed ourselves, our colleagues would call one of us and we would start the work of identifying the problem.

As a result, going to sleep was difficult because we didn’t know if we would be called. The stress affected our decision-making abilities at night as well as our capacity during work hours to operate well—or at all.

When ops grew to five people in early 2010, we could finally do what we should have done a long time prior to that but couldn’t due to workload: install a proper alerting system. This solved one of our key issues: it didn’t leave the detection of system anomalies and outages to chance.

We could now proceed with the next step in our plan, starting with defining a weekly on-call rotation. With one person on call for all backend service disruptions, the rest of us in ops were finally able to sleep, with the caveat that we might be looped in if the incident involved a core system for which we were the ops owner. Any incident concerning a noncore system was not a priority, and there was no need to fix the problem in the middle of the night. Instead, we did some basic troubleshooting to understand the problem; then, we let the service be until the following workday to be dealt with by the appropriate people. Finally, the on-call person had the mandate to call the CTO, who in turn could call anyone in the company should the need arise.

Not completely pain-free

Even though an on-call rotation was defined and alerting set up, every day was full of surprises with unplanned work, issues, and outages. Firefighting and troubleshooting took precedence over any other work. Being on call meant carrying the pager for the whole of Spotify, alerting whenever some metric went over or under statically defined thresholds. If we weren’t paged while on call, it could only have meant one of two things: either we’ve run out of SMS credits, or our cell phone provider blocked our texts due to a high rate of alerts that we were sending.

On-call fatigue was constant, but was alleviated by the company culture and camaraderie: anyone was willing to step in if the on-call engineer was tired and needed time off after hours of firefighting. It was also fun to troubleshoot outages happening in the most unpredictable ways. Learning through failure is an essential part of a healthy engineering culture.

Spawning Off Internal Office Support

As mentioned earlier, a large part of our workload in ops until this point had been internal office IT systems and support. Due to our growth, some weeks we would have four new colleagues, each eager to get started, who needed a computer, a phone, LDAP credentials, email, an introduction to our wiki, and the usual password security lecture. On top of this, our office network and shared filesystems needed maintenance and polishing. Supporting nontechnical colleagues took time and patience and took focus away from our never-ending efforts at keeping backend entropy at bay.

To better support our use cases, Spotify finally split the ops team into two: production operations and internal IT operations. This enabled each team to concentrate on its respective work.

Addressing the Remaining Top Concerns

We’ll talk a bit about some of the issues we faced at this stage and how we improved.

Long lead times

Despite splitting our internal IT support to another team, we still had a large number of requests every week, including developers needing help with the maintenance of their services, addressing network issues, coordinating with external parties of how to integrate with the Spotify backend, and maintaining communication with data center and hardware vendors.

All of these issues were dealt with in a best-effort way, which led, unsurprisingly, to very long lead times, and dissatisfied colleagues and customers.

Unintentional specialization and misalignment

We unwittingly become domain specialists by chance. If one of us solved a specific problem, the person who asked for help would inevitably return to the same person the next time they encountered a similar situation. Others would hear about this, and when they had questions on the topic, they would turn to this same person. Not only did we become knowledge silos, but because our respective approaches were not aligned with others’ outside of SRE, our solution space become siloed as well. For example, we had multiple deployment tools that were being used outside of SRE; the authors of one set of tools didn’t always know about the others’ work.

Interruptions

Finally, there was a constant flow of interruptions, making our day-to-day work—sometimes requiring extended periods of time for analysis, planning, and implementation—very difficult.

Although we were becoming a larger group of engineers, the close-knit relationship between the dev teams and what was now production operations was still alive and kicking. Many tasks and requests were handled as if we were one small team: someone looking for help, seeking guidance, or needing work done would walk over to our desks and ask for anyone around. Often, this would result in more than one person interrupting their current work, listening in, and joining the discussion. It was a fantastic way of collaborating, but the amount of context switches meant that we were much slower to make progress in our improvement work than desired.

Introducing the goalie role

Our approach to solve these three problems was to introduce a new role: the goalie. The goalie was rotated weekly and served as the designated lightning rod for all incoming requests during office hours to the ops team. If the incoming requests queue was particularly low, the goalie would attempt to solve all issues on their own, occasionally asking for help from the other ops engineers. If the queue was overflowing, the goalie would then only triage, dropping some requests and passing the remaining ones to appropriate people in the team. Rotating the goalie role minimized knowledge siloing because everyone was exposed to the most common problems.

Creating Detectives

One of the more creative and rewarding aspects of our work as SREs is when we roll up our sleeves and do forensics during and after a system anomaly. This detective work requires knowledge of every service in the ecosystem and how they fit together. A typical investigation could begin with peculiar behavior in one of our Spotify clients, leading us to study a flailing backend service, which, when stabilized, proved to be innocent. We’d then find a downstream system, which initially had seemed to be a model citizen, but in reality had triggered a chain of failures.

Even though we gained more and more knowledge and experience from jumping down rabbit holes and fighting complex and interesting incidents, we couldn’t keep up with the ever-growing backlog of remediations, leaving us exposed to repeating the same incidents again and again. To make matters worse, the influx of new developers and services was adding to the rate at which incidents occurred. And though we very much wanted to ask for help from our developer counterparts, given that they were experts on their respective services, most of them lacked the big picture.

There were only a handful of SRE detectives, and an increasing number of incidents to study and resolve. We needed more detectives.

A solution emerged: we could make more detectives by teaching developers how the backend system worked. Out of this insight sparked a rather popular lecture that came to be known as “Click-to-Play.”

Initially, we explained the backend and how it worked to a few interested developers. We noticed that it was easier both for us to teach and them to learn if we followed a scenario that touched upon all the backend systems involved, starting when a user logged in and ending with when a song began to play. Eventually, this became a standard component in the onboarding process for engineers at Spotify, and a shorter, nontechnical version is now taught to all Spotifiers around the globe.

Key Learnings

Our key takeaways from this period of high growth were:

  • Deployment windows work to protect ops, so leverage them wisely.

  • Alerting and on-call need to have processes and expectations in place. Fail fast and learn through failure.

  • Teaching skills and responsibilities is an essential part of operations. Hold talks on teaching how the entire system works.

  • It’s tempting to have a single team handle production and IT in an early stage, but to be a productive SRE team, you should split it.

  • A “goalie” role or another formal way to handle interruptive work helps the team focus on proactive work.

Pets and Cattle, and Agile: 2011

Prelude

In this section, we talk about how we needed to become more agile in the way we approached operations and how our values informed that growth:

Agile ops

We needed to move from a mindset of treating servers like “pets” to one in which hardware clusters were “cattle”; this radically changed how we approached tooling and operations processes.

Core engineering values

Our inclination toward autonomy and trust from the start informed our ways of working. However, as we became comfortable with our operational processes, a larger shift was underway in the tech organization, which would cause us to again reevaluate our approach.

In the past, when we talked about services, we often talked about individual servers: “Server X’s disks are full”; “We need to add another CPU-heavy server to lighten the load of the other login servers”; and so on. This was important to us because each server had its own character and personality, and only by knowing this could we really optimize the usage of these servers. Furthermore, each healthy server was someone’s pet. You could tell which services were healthy by looking at how well the server was taken care of: for example, whether the /home directory was regularly cleaned or the service-specific logs were nice and ordered.

Our world was server-centric rather than service-centric, and this showed in our conversations, our prioritizations, and our tools.

Forming Bad Habits

The tools initially built and used at Spotify to manage our nascent fleet assumed that a server, after it was installed and live, would continue to work in that corner of the backend until it was decommissioned. It was also assumed that provisioning a server itself would be an unusual event in the grand scheme of things—something done only a few times a month. Therefore, it was acceptable that the tool that installed a server took a few minutes to run, with frequent manual interventions. From beginning to end, the installation of a batch of servers could take anywhere between an hour to a full workday.

By 2011, we were installing new servers far more often than anticipated. Repurposing servers was a known process, but there were often gotchas that forced operations to step in and fix things manually.

This wasn’t strange or wrong for a small company with a handful of servers, but at this point in our journey, Spotify already had two data centers with live traffic and a third on the way. With far more than a handful of servers, this was becoming increasingly difficult to do.

Breaking Those Bad Habits

There was a paradigm shift in the air, and though some of us saw it coming, none of us really knew what this would entail for us and the availability of our services. This paradigm shift, when it finally hit us, was simple and obvious: instead of adapting our intent to the hardware, we needed to adapt our hardware to our intent.

This shift was difficult for many of us to relate to because we were too comfortable in our routines. In fact, it took several years for this mindset to change completely, probably because it took that long for us to create the tools to make this shift in mindset possible in the first place.

Toward the end of 2011, another largely unnoticed shift began: feature developers were beginning to organize differently. Over lunches, we heard the devs talk about autonomous self-organizing teams and about which chapter they belonged to. We could still talk with them about services and users, but conversations now included a focus on products and stakeholders. The entire dev department was slowly making the transition into a scalable Agile organization, and we in ops observed this from the sidelines.

From the perspective of those of us in ops, what was most jarring about this organizational change was the ancillary effect it had on our role. The shift in nomenclature from “dev teams” to “squads” with the added emphasis on autonomy and self-organizing challenged the centrality to which we’d become accustomed. Squads were in turn grouped into so-called “tribes,” which also had explicit intent in being autonomous and self-organizing.

Instead of one organization to work with, ops was now faced with multiple tribes, each structured slightly differently, with worryingly few indications of staying homogenous.

Key Learnings

Some of our key learnings from this period were:

  • Switch the mindset from server-centric to service-centric and make the tooling reflect that.

  • The introduction of the Agile matrix organizational model forced us to rethink operations.

A System That Didn’t Scale: 2012

Prelude

In this section we’ll talk about our challenges scaling the ops team with the organization:

“Iterating over failure”

Scaling effectively continued to prove difficult, and divesting our responsibilities was not enough. We needed to revisit what operations at Spotify meant.

“Ops by default”

A central ops team doing most of the operational work doesn’t scale. We needed to make ops truly default by completely shifting the operational responsibilities closer to developers.

It was now 2012 and the Spotify user base continued to grow, introducing new scalability and stability problems for us.

The ops team was composed of fewer than 10 SREs (it was, in fact, during this year that the term “SRE” was adopted), unevenly split between Stockholm and New York. As an ops owner, each SRE was operationally responsible for dozens of backend services: handling deployments of new versions, capacity planning, system design reviews, configuration management or code reviews, and maintaining operational handbooks, among several other day-to-day operational duties.

Our backend was now running in three data centers, with a fourth on the horizon, which we had to operate and maintain. This meant that we were responsible for configuring rack switches, ordering hardware, remote hands cabling, server ingestion, host provisioning, packaging services, configuration management, and deployment—the whole chain from physical space through the application environment. As a result of this widening responsibility, we formed another team to develop tooling automation, which worked closely with ops. One of the early products developed by this team was a configuration management database (CMDB) for hardware inventory and capacity provisioning.

With tailored configurations and nonuniformity, predictability was hard. An ops owner worked closely with the dev owner of a service in efforts to improve quality, follow production readiness practices, and run through a now-formalized deployment checklist to ensure that operational standards were present even during the initial service design. Concerns we regularly brought up included the following:

  • Is the service packaged and built on our build system?

  • Does the service produce logs?

  • Is there graphing, monitoring, and alerting?

  • Are backups set up and restore tests defined?

  • Was there a security review?

  • Who owns this service? What are its dependencies?

  • Any potential scalability concerns? Are there any single points of failure?

Manual Work Hits a Cliff

Even though we deployed constantly, either manually or via configuration management, continuous delivery was not in place. We were still delivering at a good pace, continuously, at the expense of work done by hand.

Service discovery consisted of static DNS records with manually edited and maintained zone files. DNS changes were usually reviewed and deployed by ops. We achieved mutual exclusion during DNS deploys by shouting “DNS DEPLOY!” on IRC and manually executing a script.

The ops owner and dev owner regularly went through a capacity planning spreadsheet to ensure that the service had enough capacity to sustain the current increase in usage. Access patterns and resource utilization were collected and used to predict capacity needs according to the current growth. For an ops owner of dozens of services, that meant doing a lot of capacity planning.

During the second half of 2012, we hit one million concurrent users. That meant a million people listening to music connecting directly to one of our three data centers. A pretty huge feat.

Looking back, that should be attributed to some early decisions like Spotify backend architecture being designed to scale, early client/protocol/backend optimizations, and good implementation that paid off. Also, with no multitenancy, most failures could be easily pinpointed and isolated. Every backend service did one thing and did it right. We kept it simple.

As the number of users streaming music continued to increase exponentially, the ops team couldn’t do the same. We found that the centralized ops SRE team was a system that could not scale.

Key Learnings

Our key learnings from this period were:

  • Make ops truly default from the get-go; if you build it, you run and operate it.

  • Shift operational responsibilities closer to the know-how: the developers.

  • As your service ecosystem scales, make sure to revisit how operational work is scaling. What worked well yesterday might not work well today.

Introducing Ops-in-Squads: 2013–2015

Prelude

In this section we talk about how a new approach to operations reduced bottlenecks and allowed the tech organization to grow more rapidly:

“Iterating over failure”

Adopting the new model of Ops-in-Squads freed us up to focus on minimizing our amount of manual work.

Core engineering values

Squads owning their own operations unwittingly helped us to maintain our values in autonomy and trust. Engineers had the potential to inflict widespread damage but were given the tools and processes to avoid it.

At this point, the engineering organization had become too large to operate as a single team. The Infrastructure and Operations (IO) tribe was formed, the home of teams focusing on delivering infrastructure for our backend developers and tackling the problems that came with operating at scale. One part of this tribe was called Service Availability (SA), mostly consisting of our engineers previously working in production operations. By 2013, SA consisted of four squads: security, monitoring, and two squads working on any other infrastructure tooling needed to provision and run servers. The rate at which we hired new developers and started new dev teams was too high for those four SA teams to keep up. More and more, the ability to deliver new or improved features slowed down due to the pace at which we could buy and rack new servers, review and merge Puppet changes, add DNS records, or set up alerting for a service. Dozens of feature teams were impatiently waiting to get their changes out to our users but had to wait on us for code reviews or provision hardware for their services.

We were also still on call for core services, and the operational quality and stability of those services suffered as we tried to keep up with our accumulation of tasks. The global “operations backlog” just kept on growing and growing.

Lightening the manual load

After looking at data of how our backend engineers worked and where they were most often blocked, we decided to focus on removing those blockers wherever possible. It was time to bring back the small startup collaboration with mutual trust between dev and ops and enable our feature teams to iterate fast while still keeping production reliable.

One of the first areas to improve was the provisioning of servers. When servers were available in our data centers, we still had to bootstrap the operating system and configure the basic environment. We had some basic automation, but to kick off that process, the backend engineers had to create a ticket for the ops team, specifying which data center they needed, how many servers, and for what service. The goalie would then pick this up and use our set of duct-taped databases and command-line tools to kick off the provisioning process. After about 40 minutes of automated tasks, the servers were often ready for backend developers.

Our first attempt at improving this process was a tool called provgun, short for “provisioning gun.” Instead of sending tickets to the ops team or hunting down an operations person, teams could now open a JIRA ticket to kick off the provisioning themselves. After a while, a cronjob would scan those open tickets and automatically kick off all the steps we previously did manually and then report back to the ticket when servers were successfully provisioned.

This freed up time for the ops team to focus on other things like working on the next iteration of this system: a custom web interface with hardware configurability and available stock exposed. This web interface would show the queue of outstanding requests, the locality of servers, rack diversity, and the current stock of available servers in every cage of the data centers, giving developers more choice and a better understanding of how we built our hardware ecosystem.

The next thing that was tackled was the DNS infrastructure. At the time, DNS records for servers were manually added and deployed by the ops team. The first thing the ops team did was to use a CMDB as the source of truth for determining what server records should be automatically added to the zone files. This helped reduce the number of mistakes made, like forgetting to add the trailing dot. When enough confidence was built that these zone files were accurate, they were automatically deployed to the authoritative DNS servers. This, again, freed up much of our time, and developer satisfaction increased.

Services in our backend discover one another using DNS SRV records. These, plus any user-friendly CNAMEs, had to be manually added to the zone files—a tedious and error-prone task that still required an operations engineer to review and deploy the changes.

To remove this bottleneck, we considered ways in which we could automate the review and deploy process. A basic testing framework was introduced in which one could express things that would be looked for in a review, such as “is the playlist service discoverable in our data center?” We also created a bot that allowed anyone to merge a change as long as the tests passed and it had been peer reviewed.

This was a scary addition: a simple change in one file could impact an entire data center. There were some initial incidents in which a bad push by a team took down much more than expected. But our backend developers quickly learned about the power and responsibilities that came with these changes; soon the number of mistakes dropped to almost zero. Again, this reduced the time they had to wait for the ops team from days to minutes.

After the success with DNS and server provisioning, the next piece of the puzzle was Puppet, the configuration management system that installed all the software on our servers and deployed our applications. The ops team had long before accepted patches to Puppet but reviewed every commit before merging anything. For larger commits or complex systems, this meant that they could be stuck in review limbo for days until someone found enough time to review it.

We tried the approach taken with DNS: merging your own change as long as your patch had a positive review from someone else. For the first few weeks, we were all pretty anxious and monitored almost every commit that was merged. We soon found that we had little reason to. Giving backend developers a feedback cycle—of making a change, merge, deploy, find problem, and go back and do it all over again—gave them that much more insight into many common problems we had and overall led to improved code quality and fewer mistakes.

These first few steps in making the teams operationally responsible was a success. They could now get servers, add DNS records for those servers, and deploy configuration and applications to the newly provisioned servers by themselves. Not only did we remove major friction points and idle time for our developers, but the stability of the services increased as a result of this first shift from “build it” to “run it.”

Building on Trust

However, as we removed these “sanity checks” previously done by the ops team, we faced a new problem. Without the guidance of the ops team, the entropy in our backend exploded. Poorly tested services and simple experimentations made it into the production environment, sometimes causing outages and paging our ops on-call engineers who now lacked the required understanding of what was running in production. We needed to figure out how to get the knowledge and operational responsibilities back into one place, paging the right people to troubleshoot an incident.

The traditional ops-owner approach worked well when Spotify was a small organization, but it had complexity and scalability issues, both technical and organizational. It became impossible to rely on a single or a pair of systems owners to figure them all out. We also realized that acting as a gatekeeper to operations meant we kept a monopoly on operational learning opportunities.

Handing this responsibility over to the backend teams seemed like the logical next step. The next question was how to proceed. We needed to actively engage the teams and figure out a way forward with the rest of the tech organization.

We started rolling out a new way of working—dubbed Ops-in-Squads—which in essence included everything needed to hand over on-call and operational responsibility for services to the teams that developed them. We needed to write tooling to do many of the things that were done by hand before; we needed better documentation and training so that developers would be able to troubleshoot production issues; and we needed buy-in from developers to drive this change.

A guild created for developers and ops-people to share knowledge among everyone and open a bidirectional communication channel helped define some key things we needed:

  • A standardized way of defining Service-Level Agreement (SLAs; we did not use the Service-Level Indicator [SLI]/Service-Level Objectives [SLO] concepts back then)

  • A how-to crash course on being on call, handling incidents, performing root-cause analysis, and holding postmortems

  • Guidance on capacity planning

  • Best practices for setting up monitoring and alerting as well as instruction on interpreting monitoring data

  • Training in troubleshooting, system interactions, and infrastructure tooling

Handing all of this responsibility over to the teams would of course be a big undertaking over a long period of time, and the infrastructure team still remained to support and build upon the core platform. We struggled with defining exactly what fell into the core platform; some things were obvious, like networking, monitoring, and provisioning, but other systems were harder to classify. Critical parts of our infrastructure, like the user login service, the playlist system, or the song encryption key system, needed to be evaluated: are they core infrastructure, or should they be handled like any other backend system?

We wanted to make a deal: we remove gatekeeping and friction points in exchange for shifting operational responsibilities into feature teams. If a team needs to deploy a change on a Friday, it shouldn’t be prevented from doing so. We assumed that the knowledge for operating services is held closest to where it is developed.

This was not met with cheers and optimism from everyone in the organization; there were concerns about negatively impacting feature teams’ speed of iteration. How would teams have sufficient bandwidth to deliver on features and growth if they also handled the maintenance and operational duties of their systems? The teams consisted of developers, not SREs; how much of their time would they have to devote to learning and practicing operations? Preparation, process, and tooling would be essential to convince people to take on this responsibility.

We continued investing heavily in picking blessed tooling, writing and promoting frameworks, and writing documentation to help make the transition as smooth as possible for most teams.

Driving the Paradigm Shift

We aim to make mistakes faster than anyone else.

Daniel Ek, Spotify founder and CEO

To reach all the teams with the information and tools needed, we took several different approaches. We held presentations for developers, discussing how the systems in our backend fit together and how to use the shared infrastructure. Our principal architect internally held postmortems for dozens of developers, walking them through a large incident that had resulted from a cascading failure. We made it easier for teams to find documentation and ownership for infrastructure systems by having a central entry point. We developed an Ops-in-Squads handbook that became the standard document to which to point new developers. It contained most of the information that teams needed to get started; pointers on things to do; where to read more; and checklists of processes, technical features, or processes to implement. We embedded in backend teams for short periods to transition the host team to being on-call for their services; helped teach operations; and worked directly with improving the systems, deployment procedures, on call handbooks, and so on for teams that requested or needed it. Again, we made a deal: we’ll work with the team for a few sprints and “clean house” before handing over the on-call responsibilities to them.

In 2014, this approach—embedding engineers in teams—was expanded into an Ops-in-Squads tour where some parts of our operations team went week by week to different teams, engaging with their daily work and trying to help them on anything with which they struggled.

During these embeds, we reviewed the architecture of the host teams and looked at alerting and monitoring, helping to improve these when needed. We also discussed how to do on-call and shared best practices around schedule rotation, how to escalate problems, and how to hold postmortems. Postmortems, in particular, had a reputation of being time-sink ceremonies. Teams that had a lot of incidents often skipped postmortems because the work to establish timelines, find root causes, and define remediations for 10 or 20 incidents per week seemed like too much overhead. Finding different ways of approaching postmortems proved useful in these cases, like clustering incidents after a theme, or doing short postmortems for multiple incidents at once. Often many incidents had similar root causes; exact timelines weren’t as important as identifying the top remediations that would reduce the likelihood of that type of incident by 90% in the future. Throughout this shift, we retained the principle of blameless postmortems, emphasizing the importance of learning from our mistakes, and ensuring that no one was thrown under the bus.

Key Learnings

Some of our key learnings from this period were:

  • Strive to automate everything. Removing manual steps, friction, and blockers improves your ability to iterate on product.

  • Ensure that self-service operational tooling has adequate protection and safety nets in place.

  • Teaching ops through collaboration is vital to making ops “default” in the organization.

Autonomy Versus Consistency: 2015–2017

Prelude

In this section we’ll talk a bit about how we tried to balance autonomy for squads with consistency in the tech stack:

Iterating over failure

Our first approach at introducing consistency in the technology stack caused unintentional further fragmentation. Although we continued to reiterate and address these newly exposed concerns, we found ourselves blocking teams once again.

Core engineering values

Focusing on unblocking feature teams allowed us to maintain squad independence and freedom while introducing much-needed infrastructure consistency.

In moving forward with the Ops-in-Squads model, we shifted our focus to standardizing our technology stack. With this decentralized operations model, we needed to reduce the cost of operations for teams by providing consistency across our infrastructure. High entropy meant expensive context switches and unnecessary overhead. However, striking a balance between consistency and our penchant for autonomy took thoughtfulness and foresight.

Bringing uniformity to Spotify’s stack came in a few forms. Although teams were now in charge of operating their services, we needed to get out of their way by building abstraction layers in the right places. The first levels of abstraction layers were tools for ourselves, iterating on our early successes of removing friction points with provgun and DNS. Out of this work came a number of tools: moob, for out-of-band management of hardware; neep, a job dispatcher to install, recycle, and restart hardware; and zonextgen, a batch job to create DNS records for all servers in use, to name a few.

Building off of the earlier legwork, in 2015 through 2016 we concentrated on creating and iterating on self-service tooling and products for feature developers around capacity management, Dockerized deployments, monitoring, and SLA definitions. No longer did developers need to plan capacity with spreadsheets or deploy via SSH’ing into servers to do an apt-get install. A few simple clicks of a button and squads got the capacity they needed with their service deployed.

We established a blessed stack in 2015, one that is explicitly supported and maintained by the infrastructure team, that became the “Golden Path” for how a backend service is developed, deployed, and monitored at Spotify. We began using the term Golden Path to describe a series of steps that was supported, maintained, and optimized by our infrastructure team. Instead of dictating a “blessed stack” as a mandatory solution, we wanted to make the Golden Path so easy to use that there was almost no reason to use anything else.

The Golden Path consisted of step-by-step guide for our developers. It included how to set up their environment; create a simple, Dockerized service; manage secrets; add storage; prepare for on-call; and, finally, properly deprecate a service. We built Apollo, our Java microservice framework, which would give a developer many features for free, like metrics instrumentation, logging, and service discovery. Heroic and Alien, our time series database (TSDB) and frontend, allowed engineers to create prepackaged dashboards and alerts of their instrumented-by-default Apollo services. Pairing with Helios then provided a supported way to roll out Docker deployments in a controlled manner with zero downtime.

Benefits

Squads were now able to focus even more on building features because operational tasks were continuously being minimized. But developer speed was not the only benefit to improving the consistency within our infrastructure. It allowed our migration from our physical data centers to Google Cloud Platform to be fairly seamless. Most of the work happened behind the scenes from the developer point of view. Creating capacity in Google Cloud Platform was no different than on bare metal. Neither was deploying or monitoring a service. From the developer point of view, moving to a new squad or doing a temporary embed became frictionless, given a consistent set of tools to work with.

Another benefit to a consistent, blessed stack is that it prevents us from falling into potential traps of trendy technologies that come and go. We also have insights and understanding about what engineers are using, informing us how to improve our products. There is less operational complexity and fragmentation, which can be a nightmare during incidents. We’re able to prescribe best practices for operating a service, and provide ongoing support. As an infrastructure team, we decided to have fewer responsibilities, but strived to handle them well.

Trade-Offs

It was not all easy sailing, though. Standardizing a technology stack still had its drawbacks. Too restrictive, and we risk losing squad autonomy and experimentation, hindering development when not all use cases are addressed. We defined the Golden Path to be an Apollo service, but hadn’t yet provided explicit support for the legacy and internal Python services, frontend applications, or data pipelines and analytics. We provided the ability to easily create and destroy capacity, but we didn’t yet generally support autoscaling. These unaddressed use cases indirectly contributed to the fragmentation that we were confronting as some teams created their own bespoke tools for workarounds.

In some cases, supported tools were not used. JetBrain’s TeamCity used to be the only Continuous Integration (CI) system used. It became too cumbersome for backend service developers, so squads spun up their own Jenkins instances. As evidence of team autonomy and experimentation, the use of Jenkins spread so quickly among squads that it soon became the de facto backend service CI tool. It made sense with the Ops-in-Squads model we adopted, except that teams lacked good habits to maintain their own Jenkins setup, leaving them out of date, vulnerable, and inadequately secured. It forced us to rethink our build pipeline support, where we ultimately developed an explicitly supported managed Jenkins service. Similarly, we found developers were not maintaining their Cassandra clusters, despite tools provided by our team. Fearing potential data loss and realizing maintenance was too much overhead for a feature team, we brought this, too, back into our operational ownership by offering a managed Cassandra service and related support.

Even though we had many self-service tools to unblock feature teams, the IO tribe still got in the way. In 2017, our focus shifted again to strengthening our overall infrastructure by prioritizing ephemerality, security, and reliability. We built tools for initiating controlled rolling reboots of our entire fleet, which encouraged developers to write resilient services. We automated regional failover exercises, bringing to light inconsistencies that teams have in their services’ capacity across regions. As we make progress in taking advantage of products and services that Google Cloud Platform offers, teams are now needing to take time away from feature development to, for instance, migrate to a new cloud-native storage solution such as Bigtable but reap the benefits when they no longer need to maintain any infrastructure.

Although we succeeded in shifting to the Ops-in-Squads model and struck a balance between autonomy and consistency, we are now focusing on removing friction from operations—but we still have a long road ahead.

Key Learnings

Some of our key learnings from this period were:

  • Golden paths provide a low-friction way of getting from code to production fast.

  • Support one blessed stack and support it well.

  • Giving clear incentives is essential for a blessed stack adoption (e.g., operational monitoring, continuous deployment pipelines).

The Future: Speed at Scale, Safely

When thinking about the future, we can imagine a landscape where the operational burden for feature teams has been safely reduced to nearly zero. The infrastructure supports continuous deployment and rapid experimentation across hundreds of teams, and there is little to no cost for the majority of developers in operating their services at scale. This is the dream, but we’re not fully there yet.

There are a number of technology shifts being made as part of this zero ops dream. The first part of this strategy for us is the migration to the cloud. Instead of spending time on tasks that don’t give us a competitive advantage, like data center management and hardware configuration, we can shift that problem to cloud providers and benefit from their economies of scale.

The second part is around adopting cloud primitives, shifting from bespoke solutions to open source products with vibrant communities. An example of this is our planned move from our homegrown container orchestration system, Helios, to a managed Kubernetes services (Google Kubernetes Engine). In adopting Kubernetes instead of further investing in our own container orchestration system, we can benefit from the many contributions of the open source community. Making these shifts allows the ops teams to focus on higher-level problems facing the organization, thereby delivering more value.

Even with the abstractions of the cloud, ops teams still own the uptime of the platform. Toward this end, we are adopting a mantra, speed at scale, safely, or s3 for short. We want to enable Spotify to iterate as fast as possible but to do so in a way that is reliable and secure. Our move to the cloud is consistent with this message, but as infrastructure and operations engineers, we also face a more nuanced problem space. Initially, services, data centers, network, and hardware were all architected, provisioned, and managed by us; we understood the intricacies of operating and supporting these systems. With the cloud and, moreover, our ever-increasing scale, we need better insights, automation, and communication channels to ensure that we can meet our internal availability SLO of 99.95%. Therefore, we’ve invested in reliability as a product, which includes domains ranging from chaos engineering to black-box monitoring of services.

Another part of the challenge of speed at scale, safely involves how we guide feature teams through the myriad technology product offerings. We can’t serve as roadblocks to innovation, but we also need to ensure the reliability of the platform, which means we need to guarantee consistency in those cases for which reliability is a core concern. This involves building the right tooling, conformity engineering, a robust developer platform that engineers use to find the right product for their needs, and a concerted teaching and advocacy effort.

Toward this end, we will also need to reevaluate our Ops-in-Squads model. Although this structure has served us well, we’ll need to consider how we can further reduce the operational burden that feature squads face today. In the few years since this change was put in place, we have grown and learned as a company, and the same metrics for success that applied in the past might not be as applicable in the future. We don’t know exactly what this will look like, but we know that we’ll be agile and continue to experiment to find the best solution. For example, consider our incident management process. A team’s on-call engineer will triage incidents affecting its services and, if needed, escalate to the IMOC for assistance with overall incident coordination and communication during high-impact situations. Although the IMOC structure has been an important mechanism to quickly swarm on high-impact issues, with our increasing workforce spread across the globe, it’s not always well understood when to page IMOC and, more generally, how to handle team on-call. In such cases, we will need to improve our teaching of operational best practices throughout the organization, and how we do so—how we advocate for operational quality— might require adjustments to our original thinking behind Ops-in-Squads.

Finally, as the technology landscape changes, so do our site reliability needs. One area that we’re exploring is machine learning. With thousands of virtual machine instances and systems, and more than a hundred cross-functional teams, we might find machine learning an effective way to make sense of an ever-growing, complex ecosystem of microservices. Furthermore, we also have the opportunity to revolutionize what has largely been manual tunables in the past, whether it be “right-sizing” our cloud capacity needs or detecting the next big incident before it occurs; the opportunities here are manifold. We might not only predict incidents, but also mitigate them with self-healing services or by providing automatic failbacks to a trusted state.

There are other infrastructural trends to consider; for instance, adopting serverless patterns imposes a set of challenges for our service infrastructure, from how we monitor to how we deploy and operate. We also see a shift that’s both organizational and technical; our developers are increasingly working across the landscape of backend, data, mobile, and machine learning systems. To support their end-to-end delivery, we need touchpoints that are seamless across these areas. Reliability can no longer be focused on the robustness of backend-only systems but must also consider the network of data and machine learning engines that provide increasing value to our ecosystem.

We hope you’ve enjoyed learning about our successes and failures in building a global operations and infrastructure organization. Although we’ve experimented with many different flavors of SRE and now function in this Ops-in-Squads model, we believe that what has made us successful is not the model itself but our willingness to embrace change while keeping core SRE principles in mind. As the technology landscape changes, we’ll also need to keep evolving to ensure that the music never stops streaming.

Contributor Bios

Daniel Prata Almeida is an infrastructure and operations product manager for reliability at Spotify. In a previous life as an SRE, he carried the pager and nurtured complex distributed systems. He’s addicted to uptime.

Saunak Jai Chakrabarti leads infrastructure and operations in the US for Spotify. He’s passionate about distributed systems and loves to study all the different ways they break or work unexpectedly.

Jeff Eklund, formerly at Spotify, is an SRE and technology historian with a passion for the arcane and quirky. He believes every computer problem can be solved by tenacity, camaraderie, and ramen.

David Poblador i Garcia, ex-SRE and product lead, now engineering lead for the technology platform at Spotify, is rumored to be the friendly face of deep technological knowledge and strategy.

Niklas Gustavsson is the former owner of Spotify’s worst-behaving service and currently serves as Spotify’s Principal Architect.

Mattias Jansson (Spotify) is an ex-{Dev,Teacher,SRE,Manager}, current Agile coach. He’s a history geek with a passion for systems—both silicon- and carbon-based.

Drew Michel is an SRE at Spotify trying to keep the lights on in the face of chaos. In his spare time, he enjoys running long distances and walking his Australian Shepherd dog.

Lynn Root is an SRE at Spotify with historical issues of using her last name as her username, and the resident FOSS evangelist. She is also a global leader of PyLadies and former Vice Chair of the Python Software Foundation Board of Directors. When her hands are not on a keyboard, they are usually holding a bass guitar.

Johannes Russek, officially former SRE and technical product manager at Spotify, unofficially dropped-ball finder and software systems archeologist. Favorite tool is a whiteboard.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.54.7