Chapter 13. Production Engineering at Facebook

David: What’s production engineering?

Pedro: Philosophically, production engineering stems from the belief that operational problems should be solved through software solutions and that the engineers who are actually building the software are the best people to operate that software in production.

In the early days of software, a developer who wrote the code also debugged and fixed it. Sometimes, they even had to dive into hardware issues. Over the years, with the advent of remote software systems, the internet, and large data centers, this practice changed dramatically. Today, it’s still common to see software engineers writing and developing applications, then handing off their code to a QA team for testing, and then handing that off to another team for deploying and debugging. In some environments, a release engineering team is responsible for deploying code and an operations team ensures the system is stable and responds to alerts. This works fairly well when QA and operations have the knowledge required to fix problems, and when the feedback loops between the teams are healthy. When this isn’t the case, fixing and/or debugging production issues needs to work its way back to the software engineers, and this workflow can significantly delay fixes. At Facebook, our production engineering [PE] team is simply bringing back the concept of integrating software engineering [SWE] and operations.

We started the PE model a few years ago to focus on building a more collaborative culture between the software engineering and operations teams. Our goal is to ensure that Facebook’s infrastructure is healthy and our robust community of users can access the platform at any hour. The PE team is a critical component in accomplishing this through automation, writing new tools to make operations easier for everyone, performance analysis, hardcore systems debugging, firefighting when necessary and by teaching others how to run their systems themselves. Facebook engineering has built common infrastructure that everyone uses to build and deploy software. Facebook’s infrastructure has grown organically over the years, and while I’m confident that we will solve many of the operational problems we have today, we still haven’t. Production engineers help bridge this gap to ensure teams can get back to solving the hard software problems we face and spend as little time as possible on operational problems.

The PE team not only writes code to minimize operational complexity, but also debugs hard problems in live production that impact billions of people around the world—from backend services like Facebook’s Hadoop data warehouses, to frontend services like News Feed, to infrastructure components like caching, load balancing, and deployment systems. PE, working side by side with SWE, keeps Facebook running. The team also helps software engineers understand how their software interacts with its environment.

Think of production engineering as the intersection of large-scale manufacturing (hardware, automotive, industrial, etc.), expert engineering, and good operational management. A production engineer typically has wide knowledge of engineering practices and is aware of the challenges related to production operations. The goal of PE is to ensure production is running in the smoothest way. A great analogy to our role is that of a production line for manufacturing automobiles. A team of designers creates the car, a team of engineers builds the hardware and another team is responsible for the automation that puts it all together. When this process breaks, a production engineer steps in with knowledge of the whole flow of the production line, including everything upstream and downstream. They understand how the automobile was designed, how it’s supposed to function, and what the software used to build it is supposed to do. Armed with this knowledge, production engineers can troubleshoot, diagnose, and fix the issue if they need to, and they also work with the entire team to prevent it from happening in the future.

At Facebook, production engineers are not line operators, but they do know how everything in the line actually operates. For example, when software is responding to user traffic, or even when it’s failing, production engineers are often the ones who best understand how the code interacts with its environment, how to fix and improve it, and how to make it performant over time.

David: Can you say a little bit more about the origin story of PE?

Pedro: During its early years, Facebook applied the then-industry-accepted approach to operating its production website and services via a dedicated operations team. The operations team consisted of separate Site Reliability Engineering [SRE] and Application Operations [AppOps] teams.

The SRE team had more in common with a traditional communications provider’s network operations center [NOC] than it did with solving operations problems through software. At the time, a team of less than 20 monitored the systems infrastructure for issues, reacted to alerts, and triaged problems using a three-tier escalation process with the support of AppOps. SRE worked in three shifts to provide 24/7 coverage. AppOps, on the other hand, was a small set of individuals already embedded within Software Engineering [SWE] teams. The model back then was effectively one AppOps engineer per service (for example, newsfeed, ads, web chat, search, data infrastructure) with a ratio of production engineers to software engineers of approximately 1:10 but could range as high as 1:40. Sometimes, there was one AppOps engineer assigned to multiple services. For example, data warehouse and centralized logging systems had one embedded AppOps engineer. Early on, AppOps was able to understand the full application stack well enough to react to and resolve outages quickly.

Even though the SRE team was supposed to be engaged with SWE teams developing software, SRE was often absent in the early development process and regularly found out about new software changes or new services being built as they were being deployed. The relationship between SRE, AppOps, and SWE wasn’t strong. The SRE team also had to juggle the additional responsibility of lighting up new data centers and ensuring capacity needs were met during peak traffic hours. SRE ensured capacity needs were met by shifting users’ loads across multiple data centers using web interfaces on enterprise-level load balancers. There was some level of automation happening through ad hoc shell scripting and lightweight tooling that incorporated APIs from these devices or from in-house written code bases.

While the SRE, AppOps, and SWE teams focused on scaling the website and infrastructure for web-based services, users started transitioning from desktop to mobile, and the whole company was reprioritized to follow a mobile-first strategy. This significantly increased service complexity and accelerated the need to provision an ever-increasing amount of infrastructure. The scaling of Facebook’s services in a mobile world, combined with hyper-scale customer growth, overwhelmed the SRE and AppOps teams. The majority of the AppOps engineers had to be on call 24/7, and firefighting was the norm. As user growth and complexity increased, both SRE and AppOps were unable to focus on recruiting and couldn’t hire additional staff for months, pushing the teams further underwater.

Adding to an already challenging situation, infrastructure capacity was falling behind due to a lack of automated data center provisioning from the SRE team and the constant need to fix individual servers in production. Without the necessary automation to keep up with server failures and adding new clusters, we ended up in a capacity crunch. People were overwhelmed and experiencing burnout. A better approach was clearly needed. Our management team recognized that the operations team had not kept pace with changes in the business and that the current operating model was no longer adequate.

The main challenges we faced were unclear expectations of the roles of SRE, AppOps, and SWE; an inability to balance firefighting, server failures, and turning up new capacity; weak credibility with the SWE teams when asking for changes in architecture.

We first focused our efforts in clarifying the roles for SRE, AppOps, and SWE. We understood that we needed to be involved earlier in the software process and that the embedded model had a higher likelihood of effecting change. We also needed to establish stronger working relationships and credibility between the software and operations teams and ensure that the SWE teams had stronger ownership of their services.

We executed a multistep reorganization and hired different types of engineers. To continue managing the growing infrastructure needs, we decided to split the SRE staff into two. We would retire the SRE moniker and stand up a new Site Reliability Operations [SRO] team, and we would expand the existing AppOps team with some members from the former SRE team. By moving key individuals from the former SRE team we were able to expand AppOps’ operational knowledge and staffing quickly and double the team’s size. Retiring the SRE moniker also helped change expectations from the SWE teams over time.

SRO’s remit was to focus in two areas. The first was to continue the needed and noble effort of firefighting outages. The second was to build software and automation to reduce the human involvement in these efforts. We also moved some of the reorganized SREs into an SRO function focused on turning up new capacity.

The discussions around expectations of these roles brought to light the fact that many SWEs didn’t have empathy for the role of these operational teams. Operations was a bit of a black box to many software engineers, and we needed to change this perception. The embedded model would help to make sure that the SWEs understood what it takes to operate a service at scale, and, later, as we embraced shared on-call, we would make space for SWEs to gain more empathy. However, we didn’t always have the credibility needed to influence the SWE teams to change the architecture to be more stable. Part of the problem was that SWE, SRO, and AppOps didn’t speak the same language of algorithms, concurrency, scalability, and efficiency that goes into building a distributed system. There were a few engineers in SRO and AppOps that had this knowledge, but it became clear that we needed additional individuals who were more like software engineers and also understood operations to help make this transition. As a first step, we actively recruited and hired several technical leaders with deep experience in operations and infrastructure, focusing heavily on candidates who demonstrated a strong cultural fit in addition to technical acumen. These new hires needed both algorithmic and practical coding skills in addition to an understanding of low-level systems and building distributed services. We also desired supportive communication and influence skills. We were successful in finding this type of engineer and adding them to both SRO and AppOps.

Changing the expectations for SRO and AppOps from SWEs is challenging. Engineers that had been at Facebook when SRE and AppOps had been established expected these teams to merely do firefighting and operational work for the SWE teams. A typical comment overheard in meetings when discussing remediation of issues during outages and ownership of stability would have been: “Well, that’s the AppOps job. I shouldn’t be doing this as a SWE.” Even though there were some SWE teams that had a culture of strong ownership of stability and did firefighting or worked on solving operational problems, many did not. We needed to change the expectation of what the operations teams did and we needed the SWEs to take ownership of their stability and operational issues in production.

It wasn’t enough that we had changed our hiring practices to focus more on software engineering skills, in addition to network fundamentals, systems internals, and large-scale systems designs. In order to reset the expectation bit, we needed to rebrand. We decided to remove the word “Ops” from the “AppOps” name and became Production Engineering.

Removing “Ops” from the name had two main outcomes:

  • It communicated to the rest of the company that we were an engineering team building software to solve operational problems, not just traditional operators of systems.

  • It helped reset the notion that SWEs could just have their services run by some other ops team.

We still had one major hurdle to overcome related to ownership of stability and operational issues. As more SWEs and, now production engineers worked on scaling services, SRO continued to be an operational backstop during service outages and responded to alerts 24/7. SRO embarked on building an automation framework we named FBAR [FaceBook Auto Remediation] that solved one of the main pain points at the time of running a service in production: removing failing servers and replacing them with healthy ones. SRO continued to grow while this system matured with a future eye of handing over this operational work to automation written by PEs and SWEs.

After a few years of building up PE teams and hiring more SROs, the complexity of our infrastructure grew so much that our centralized SRO team no longer was able to comprehend how every service worked. Their ability to respond to outages became difficult, and SROs became worried that instead of fixing the issue, they may cause a new outage. We realized we needed to dissolve the SRO team and hand complete operational ownership 24/7 to the PE and SWE teams. SRO spent the next 18 months working with PEs and SWEs to write code that ran on FBAR to handle many of the server issues that required SRO response. Where we had PEs paired with SWEs, they created a shared on-call rotation. Where SWEs didn’t work with PEs, they took on-call ownership. At the end of these 18 months, the centralized SRO team was dissolved, and its members moved over to production engineering. It took about four years to complete the transition from SRE and AppOps to all production engineering.

David: Let’s talk a little bit about structures. Some SRE models have this notion that software teams get an SRE after a certain amount of work is completed by the SWEs. SREs are sourced from a completely separate group that is not part of the product group or the service group. They live in their own org. How do you think of the organization around PE? How do production engineers relate to the larger org tree?

Pedro: Our model borrows from the separate organizational structure of some companies and the embedded nature of some SRE and operations teams, where they report into the product group or the business unit. We have a centralized reporting structure and a decentralized seating structure.

The operations function needs to ensure that operations is running smoothly across all of the engineering functions. Sometimes, this may be hindered by competing priorities of the people who report to the highest C-level executive (for example, the CEO). In many companies, the highest-level operational executive (Head of Ops) usually wants to report to the CEO, and I believe this causes more problems than it solves. I believe the operational executive should instead report to the most senior software engineering executive (Head of Eng). This effectively makes the Head of Ops a peer to other software engineering groups instead of a peer to the Head of Eng. This also makes the Head of Eng responsible for the success of operations. When performance assessments are taken seriously, and if the operations team isn’t succeeding, then the engineering executive is not succeeding. This, in my opinion, fixes a lot of the dynamics of “Dev versus Ops” because Ops is now part of Dev.

I recognize that there are a lot of larger companies out there that have splintered their operations teams and they each report into the head of their business unit. I’m sure in some cases this works, but it’s very hard to make that work well. What I’ve seen happen more often is that the head of a business unit forces the operational team to do their bidding and the escalation path is very muddy when it comes time to deal with a disagreement.

So, to prevent some of these dynamics, we kept a centralized reporting structure for these main reasons:

Flexibility

Our production engineers can work on hardware design, UI, backend software, and everything in between. For example, a production engineering team designed and built out a Faraday cage for wireless mobile phone testing. We developed a specialized rack with mobile devices where engineers can run their software on various types of mobile hardware and do on-device testing. The centralized reporting structure gives us the flexibility to set our own goals and decide what kind of work we “should do” versus being told what work needs to be done by the lead of a specific business unit. The headcount for people doing PE work is managed separately from the broader engineering headcount and so the leadership in PE can ensure this work won’t get reprioritized by the business lead who may have different priorities.

Motivation

PE managers are able to motivate their teams to get work done based on the problem they’re trying to solve, as opposed to focusing on shipping a product or service. There’s a qualitative and a quantitative aspect to evaluating production engineers, and certain things that motivate individuals in production engineering are not necessarily what motivates software engineers. The PE management team is able to give people the guidance to work on the context switching that operations-minded folks are really good at when it’s needed, and the ability to solve software engineering problems when it’s time to focus on them.

I’ve found that there are those, like me, who run toward a problem instead of away. I’m pretty sure one of the reasons I chose to not go the pure software route was because I enjoyed the context switching. I have found that even though the folks who like PE-type work may complain about context switching, they actually enjoy it. There’s also a certain adrenaline-like rush they get when they finally figure out that one little detail in the system that unblocked the problem, restored the service, and allowed everyone to breathe easily and get back to work. I’ve always found that the best way to hire and manage a PE is to figure out what excites them about this kind of work and hire managers who can understand this aspect of their personality. Having performed the same (or similar) type of role in the past more often than not increases the likelihood that these managers will be able to successfully motivate them. I’ve found, conversely, that many of the managers who really just want to think about algorithms and write pure software are challenged by how to manage and motivate PEs.

Shared accountability with slight tension

We hire external software engineers that may come from a traditional Dev-to-QA-to-Ops model and believe operations should do the things software engineers don’t want to do. A separate organizational structure creates a buffer for the production engineering leadership team between operational stability and features. The SWE and PE teams need to come together to build a stable, reliable, secure, efficient, and feature-rich service. If a software engineering manager and a production engineering manager have a disagreement, we need to make sure that they are working together to solve the problem. The software engineering manager can’t say to the PE team, “You should do all my operations work for me.” The production engineering manager also can’t say to the SWE team, “You should only work on stability and reliability.” There needs to be a balance between these two competing priorities. Our structure provides that healthy tension and shared accountability for doing what’s needed, not what might be defined by a single manager’s responsibility.

At Facebook, we have a saying: “code wins arguments.” We’ve applied a similar model to keep teams accountable when there’s disagreement between operational load and new features. When this happens, we bring the managers and tech leads of both SWE and PE together to discuss their views with senior leaders in their respective organizations. These managers and tech leads provide operational metrics that help us all understand what is not working in a system/service and what the team is going to prioritize to improve these metrics. The discussion also needs to touch on what features may be delayed based on this prioritization. This way, all the leaders understand the potential impact to the business and can make an informed decision. Ultimately, I and my counterparts on the SWE teams are held accountable by our manager to build systems that are feature-rich, stable, and operable. Since the org tree for all of us meets at the same level, if we don’t do our jobs, our performance suffers. This, in my opinion, gives us the best of both worlds.

The decentralized, embedded model gives us greater abilities to influence how services are built. PEs sit right next to their SWE peers. They go to their meetings and offsites; they’re available for hallway conversations and ad hoc discussions about architecture. Both the SWE and PE managers work on what the road map should look like for a service, what will make the service more reliable, and also what features are needed to enable growth. They compromise between features and stability.

With this embedded structure, the software team also gets the constant vigilance of the production engineers as well as the interaction of pointing out problems that are actively happening and need to be solved. When problems surface in production, both SWEs and PEs huddle together, shoulder to shoulder, solving the problem. When the software engineer is on call, the production engineer is sitting there with them, and the software engineer can also just lean over and ask, “Hey, I don’t know how to do this operation in production right now. Can you jump in with me, help me, and teach me how to do that so I can be more effective in the future?”

I’m sure that this isn’t the only way that works and there are likely other organizational models that can perform in the same way and don’t need to have the same reporting structure with colocated people. This is the way that we’ve found works the best for us at Facebook.

David: You mentioned on-call earlier in the PE organization’s origin story, but I thought you said that the software engineers were on call, and the PEs were not on call. Is that always the case?

Pedro: No, not always. A phrase I often use is: “if you write code and release it to production, congratulations, you own it.” This meant we needed to get SWEs to take responsibility for keeping their services up in production and also taking primary on-call. PE is not primary on-call for services we don’t own or build outright. We have a shared on-call model when we’re embedded with an SWE team and how that on-call rotation manifests itself is situational. In most cases, it’s a weekly rotation. There’s a software engineer on call for one week, then a PE on call the next week, and so forth. In some cases, due to cognitive load or because the infrastructure is currently harder to manage, some teams do shorter on-call rotations of a few days. We have only two scenarios where solely software engineers or production engineers are on call:

  • A software engineering team that doesn’t have embedded production engineers. In this scenario, they have no choice because they have to be on call for their service.

  • Places in our environment where production engineers build everything end to end.

David: Like infrastructure, the DNS team, that sort of thing?

Pedro: Yes, exactly—PE actually owns a couple of pieces of infrastructure, like the software used to provision servers, or manage server replacements in production (FBAR). FBAR is an automated system that handles repetitive issues so that engineers (SWE and PE) can focus on solving and preventing larger, more complex site disruptions. We also built systems that automate service migrations during maintenance and another focused on turning up new clusters from scratch. We own and built the L4 load balancer that sits in between our L7 load balancer and our web servers. For the Faraday cage-like rack I mentioned earlier, we built Augmented Traffic Control [ATC] that allows developers to test their mobile application across varying network conditions, easily emulating high-speed, mobile-carrier, and even severely impaired networks.

In these last few examples (and there are others), we are on call 100% of the time, because we’re the ones who built these systems. It follows the same model I described earlier where the team that built and deploys the software into production owns it, and the team that built it has the accountability to fix it when it breaks.

David: Given this structure, how do you manage the relationships between the PE org and other teams?

Pedro: Company-wide, we send out what we call Pulse Surveys every six months. One of the questions asked relates to how well a team collaborates with their partner teams. Facebook has built a lot of common software that is used by other teams and is critical to its operation. So teams need to collaborate well together, and this includes PEs embedded in SWE teams. The survey outputs a general favorability score that tells us if the PE team thinks it collaborates well with other teams. Often there have been signs along the way and this survey is now specific data we can use to narrow down the problem. We start by asking a bunch of questions. For example: “Is the relationship working well? Does the PE team feel like they have a voice? Are the PEs listening to the SWE team’s needs to lessen operational load and building tools to solve these problems? Do the PEs feel like they’re being treated as equals? Do the SWEs understand what work PE should be doing, and vice versa?”

If we find that the relationship isn’t healthy, we talk to the PE and SWE managers and tech leads to gather more feedback. If the feedback points to confusion around how to work with PE, or vice versa, we educate everyone on what a successful PE and SWE team engagement looks like and what a successful healthy partnership feels like. We also discuss what shared ownership means and how SWEs need to care about the stability of their system and that more PEs potentially doing firefighting isn’t sustainable. We make sure the PEs aren’t being obstructionists and preventing the SWE team from innovating, if for example, they are constantly saying “No” to changes. People and relationships are challenging and sometimes value systems are just not aligned, but we can’t ignore these relationships and let them become toxic.

Ultimately, if we can’t come to an agreement on how to work together, we need to make a change. If the relationship problems stem from the PEs, we work on removing the individuals causing these problems, talk to them and their manager about our collaboration expectations, and begin the work to rebuild the team if necessary. If the relationship problems stem from SWEs not wanting to own their services in the way we’ve described or constantly dismissing the work needed to make a system more stable, we will also talk to their manager and tech leads. We’ll revisit the situation sometime in the future after we’ve given everyone time to work things out. If the SWE team’s expectations of PE continue to focus on handing off operational work, then we will gladly redeploy PEs into other SWE teams. There are plenty of places where the PE skill set and discipline will be valued, treated equally, and able to effect change.

Removing an embedded PE team is a lever we pull as a last resort and after we’ve exhausted all our methods to build a strong relationship. I have only done it a small number of times because we can definitely lose credibility with the SWE team and it will make it much harder to build trust in the future. That being said, I would rather not burn out PEs trying to force something. In the few cases where this has happened, the SWE teams have come back some months later asking to try to work together again. The reality is that some software teams don’t have people who have an operational mindset and can quickly get overwhelmed with work due to a skills gap or the mounting operational debt. Sometimes, learning a lesson the hard way is the best way for everyone to start fresh.

David: Let’s continue down this path of organizational structure for one more moment. Does every project get a PE? What do you do when you show up?

Pedro: You noted earlier that in some companies, SREs don’t arrive until they hit some level of maturity or operational stability. At Facebook, our M.O. has been to ruthlessly prioritize. Not every SWE team gets to work with an embedded production engineering team and it is situational. It has to do with the service itself and its stage of development, the maturity of the service and the team. We would ideally like to enter in the nascent phase of a software team, because they’re building something new and might not know exactly what it’s going to become. By embedding ourselves early on, we could get some of the operational work accomplished early and quickly. Sometimes that happens, and sometimes it doesn’t.

There are software engineering teams that end up building some services in the beginning that don’t have production engineers. This is typically a new service that doesn’t quite have a well-defined use case. It could be someone’s innovative idea to solve a problem, but it will take a while to develop. Once that service is established, and they realize that they’re hitting the stage where scaling is critical, they come to us looking for help. We keep a running list of these services and teams and as we hire more production engineers, we use that list to prioritize what is critical and what needs the most help.

In some cases, like when we built out live video and our generalized videos infrastructure, we knew this was going to be a legitimate use case and so we were involved from the beginning. Unfortunately, we had to pull valuable members from other teams to stand that team up, and there’s always a conversation about the trade-offs we’re making.

When we engage in the scaling stage of a service, we need to figure out what work the team should tackle first, and that’s sometimes hard to pin down. The service may be suffering from any one of reliability problems, capacity, deployment issues, or monitoring issues. Andrew Ryan, a production engineer on our team who often helps us with organizational design, came up with a “Service Pyramid” that is loosely based on Maslow’s Hierarchy of Needs. I presented about this hierarchy of needs in an SRECon talk about production engineering. I later found out that Mickey Dickerson also presented about a similar service reliability hierarchy of needs at an O’Reilly conference. It was nice to see that this concept for how to approach work was shared across a few other teams.

We use this service hierarchy to prioritize the type of work production engineers do when they’re first engaged with a team. The bottom layer of the pyramid is focused on ensuring that the service is integrated well with standard Facebook tooling to deal with the life cycle of a server (provisioning, monitoring, replacement, migration, decommissioning) and service deployments (new deployments, integration tests, canarying, etc). Once you have the basic needs of your server and service met, you can move up the layers of the pyramid to working on higher-level components like performance tuning and efficiency, disaster recovery, anomaly detection, and failure modeling.

Only then can you efficiently work at the top of the pyramid on “weird stuff.” These are things that may not happen at smaller scale but happen in our environment due to the influx of data and the amount of work done on backend systems. Every PE and SWE has an obligation to investigate these weird issues, but if the basic needs of the service aren’t met, they might find themselves chasing issues that should have already been dealt with automatically instead of real problems of scalability.

David: What do you mean by stages of development?

Pedro: I see teams and services go through three phases in this order:

Bootstrap phase

Do anything and everything that you need to get the service up and running. That might mean a lot of firefighting and manual intervention to fix things. It might mean quick, iterative deployments that fail fast. It might mean just allowing a trickle of traffic at first. You build up these operational muscles and figure out what the failure modes are and how it affects other systems.

Scale phase

Once you’re out of the bootstrap phase, you start to move into the scale phase. This might include deploying the service into multiple regions, getting the service used by millions or billions of people depending on the type of service. The team gets much more mature at being able to operate the service and its feature set, understanding the dependencies on other systems, and the architectural changes that may need to occur over time.

Awesomize phase

Now the service needs to become really, really awesome. Do the last 10 to 20% of work needed to optimize the service to be more efficient and more performant. I call it “awesomize” because when I try to ask people to optimize something, nobody really wants to do that. But everybody wants to make something awesome, so I call it the awesomize phase.

The people required for each of these phases might be different. There is a certain set of people on both the software and production engineering teams who really love to do the bootstrapping work. There are also those who really love the middle of this continuum. The bootstrapping is done, and they want to scale the service: make it better, more resilient, and bigger; take on more users; and deal with the consistency and concurrency problems and the big disaster recovery problems that will arrive with a higher level of maturity. There are even others that want to make something performant, efficient, and rock-solid. Some people will evolve over time and grow with each of these phases, but my experience has been that most do not stay with the service through these three phases. They’ll find their sweet spot, and they’ll move around in the organization to find the work that plays to their strengths. Ultimately, we want everybody to do the awesomization work, but the reality is that not everybody does and that’s OK.

David: Given these phases and people’s inclinations toward them, how do you create teams?

Pedro: I’ve seen a tendency to try to stretch people to do everything and become a jack-of-all-trades that can work up and down the stack, from lower-level hardware issues, to middle-layer protocol problems, to UI programming. This model is useful and needed in many startup environments but does not work well at scale. There is no way that one human can do that type of work and make themselves sustainable over time and not burn out, so we focus more on matching individuals to technologies. For example, on the cache team, knowledge about network protocols and debugging is needed, but understanding how the overall system works is the ultimate goal.

When we are starting up a new team, we look for four factors and ask these questions:

  • Is there enough work for at least three people for the next 18 to 24 months? I came up with three because that number, to me, really defines a team. If there are only two people and one is sick or wants to take vacation, the other person has to take the entire workload. When a third person is added, then at least people can pair up on projects, define shared responsibilities, etc. It’s simple team dynamics.

  • Does the service fit our prioritization model? We need to understand how the service is solving a business need and that it will be something that’s used and not just a prototype that may never see production scale. Is it the right time to prioritize this team over another? This one is tricky because it’s much more subjective.

  • Do we have a manager who can work with the production engineers and build out a larger team? The manager is a critical component, making sure that engineers are focused and getting things done. It’s important that everyone is developing and growing over time. We need to ensure the team is getting the right level of context for their work and that they’re learning from other people.

  • Is there a local SWE team to work with on this service? This is primarily in the case where we’re not the ones building the software. We need to make sure that there is an SWE team that can engage in shoulder-to-shoulder debugging and in-person discussions around architecture and problems.

These four ingredients have to come together. That filters out a bunch of nascent projects that may take up valuable people. Although it would be nice to have production engineers on every type of team, it doesn’t make sense based on our prioritization model.

In order to establish a new location with production engineering teams, the four ingredients above need to exist and we add another constraint. We need to ensure the new location has the ability to sustain three different teams, of at least three people for 18 to 24 months working with SWEs locally. This means we establish PE teams later into the site’s maturity.

David: It does, but what about the small stuff? The things that have to be done but aren’t going to take that long? Is there a catch-all team?

Pedro: No, there’s no catch-all team that does that kind of work. In general, it is the software engineering team’s responsibility to manage their technical and operational debt. They do this for as long as they can, but eventually if their service needs to be prioritized and it makes sense to build a team, then we do. Oftentimes, PEs that have an affinity for certain work might see something on a team without PEs and spend a couple of weeks on it to make it better, and then come back to their original team. We think this is valuable overall and where we can, we encourage it because it can help the SWE team gain some quick operational efficiency and knowledge.

In infrastructure, we generally focus on making operational things like bootstrapping go away. We have built a lot of services that give engineers “more for free.” They can use our containerization service for deployments. They will get general server health monitoring for free. We have a centralized monitoring system with built-in graphing, anomaly detection, and alerting. The service gets basic remediation for free through things like FBAR (originally built by SRO, then significantly augmented by PE). All the basics are handled for you so you can focus more on the higher-level software problems. This allows our software engineers to do rapid prototyping and work on the small things first and figure out whether there’s something worthwhile building versus having to focus on the small stuff. This kind of “more for free” stuff gets you through some of the bootstrapping phase I discussed earlier without needing too much initial help because it’s all self-service.

David: We’ve talked a little bit about how a PE gets involved with a team and the product or service. How does a PE leave the team?

Pedro: Mobility is actually a core tenet of ours.

We like to hire generalists. In addition to the core practical and algorithmic programming skills, we also look for other traits. We expect PEs we hire or train to understand network protocols and how to debug them. They need to have lower-level systems knowledge and understand how software interacts with the kernel, the hardware, and the network layer. If they are further in their careers, they need to understand how to build distributed systems. These are the general skills we look for when hiring PEs. When they join a team, they may not be experts in every one of these dimensions, but, over time, they’ll gain this knowledge and experience and they’ll become even stronger engineers. This knowledge allows them to move around more easily within PE.

They will also learn how to use Facebook-built tools and services. Many of these mimic services outside of Facebook, like a containerization service. If the service is using our in-house containerization system—be it Cache, Messaging, Ads, or Newsfeed, or anything else—it’s still the same containerization service. The inner workings of the system they’re working with and the problems that come up—concurrency, consistency, disaster recovery, for example—will vary depending on the service. That is what a PE needs to learn when they land on a team, but the general skills of managing systems in our environment and how to use the tools we’ve built appropriately are portable. PEs can take all of this knowledge and move to any team at Facebook as long as they are colocated with the SWE team building that service. It’s a lot easier to collaborate that way rather than having the operations team and software engineering team located in different areas or even different time zones.

So, to answer the original question, we influence our managers to ask at 18 to 24 months into the PE’s time on the team whether they have thought about moving to another team. Generally, the answer to that question is, “No. I love the job that I’m doing. The service that I’m building still has to mature. I like my team and the work. Go away and talk to me later.” This is fine and it introduces the concept in their mind and lets them know that it’s OK to consider moving at some point and that we value mobility.

We approach the question again at 24 to 36 months and we start looking for things that would complement their current knowledge. For example, if Jane, a PE on the storage team, has been there for a long time, we might ask her if she’s ever thought of joining an in-memory cache team. The conversation goes something like this: “Hey, Jane, you’ve been on the storage team for a while, and I’d like to make sure you become a more well-rounded engineer. Have you considered moving to the cache team? They need a senior engineer like you and you have a lot of experience scaling systems rapidly. Sure, it’s cache and not storage, but you should go talk to Joan about this and see what’s going on.” Generally, her answer is, “That sounds interesting. Let me go talk to her.” Or Jane will come back and say, “You know, I’ve got three or four more months of work that I want to do. Let me finish this project, and then I’ll consider a transfer to Joan’s team or do a hack-a-month.”

Hack-a-months are something that spun out of hack-a-thons when we realized that in all of our engineering teams, we needed a better way to give people the chance to learn something new. A hack-a-month serves two main purposes. One: encourage any engineer who has been on the same project for more than a year to leave their team for a month to work on something completely different. Many use these for a little break from their normal routine. Two: to find a new team and figure out if they want to move. In either case, the team has to be able to handle the person being out, so the manager needs to ensure their staffing is at a good level or needs to work on finding someone else to take their place.

To best evaluate folks in these hack-a-months, the two managers need to be in sync on performance during that period of time. In the case of someone learning something new, there’s usually a well-defined project that has an end state that can be objectively measured. In the case of moving to a new team, we give people the room to ramp up into the new space and we take that into consideration in their performance evaluation.

After 36 months on the team, we more directly talk to the engineer about moving to another team. We do this because I believe that when people get stuck in a rut, they can slow down the progress of their team. When a new PE (or SWE) joins the team and proposes a new idea, they may be shot down by the engineer who is comfortable with the way the system works. The established engineer might reject this new idea because it changes their mental model and it changes their comfort with the system. This might stifle innovation. We’ve actually experienced this, and so we’re much more prescriptive about moving engineers to new teams when they have been in the same team for three or more years. As we expand into new geographies, engineers that are hitting this three-year mark should have enough mobility to move into other teams.

We have quite a few senior engineers who have done this over and over again. And they have their own thoughts behind this, such as how managers influence them, how they influence themselves, how they talk to each other, whether it is easy to move, and how they deal with impostor syndrome if it sets in again. We encourage these engineers to share their stories candidly because it gives others insight into this process from a nonmanager’s perspective. If an engineer experiences impostor syndrome but knows their manager has their back, it makes it easier. If the organization built around them provides this mechanism to try something new and not worry too much about performance, then mobility becomes a more fluid process.

David: What are the things required to be a successful production engineer?

Pedro: I’ll try to list a few of the key traits that come to mind:

A focus on getting stuff done

Production engineers need to have a bias for action. When we’re looking at a problem, building a system, or creating a team, there has to be a tangible problem we can work on and fix for the long term. There’s definitely a reactive portion to the role and it’s needed because systems break all the time. PEs need to be building sustainable solutions to these problems so there needs to be a focus on proactively addressing things. If we’re stuck in one mode of operation, turning the crank on the same thing over and over again, we’re not succeeding in the role.

Supportive communication and influence skills

You can’t be a jerk. Jerks generally do not interact well with people. Those of us that have been in the industry for a long time remember that the image of someone in operations associated with the BOFH [Bastard Operator from Hell]. Dealing with outages and fixing things on a regular basis can wear people down and potentially turn them into angry curmudgeons, so we need to ensure we hire folks who understand when to be direct and still be civil in the way they talk to people. It really irks me when someone wants to show off their knowledge and shoot somebody down who might not have as much experience as they do. Production engineers need to be good at communicating to be successful in the role. The “no asshole” rule applies here, and when you find these folks, you should coach them to communicate differently, and if they don’t, you should aggressively manage them out because their toxicity can easily permeate through the team.

Technical knowledge and skills

Production engineers not only need to speak the same language as software engineers, they also need to be able to speak the language of others. For example: network engineers, capacity engineers, data center engineers, and project managers. This means that they need to have knowledge about these different disciplines and be able to jump into any problem in the stack, whether it be a hardware problem or a UI issue. For example, in order to find a problem in the rendering layer, PEs need to be able to look at the code and understand how it renders data. Production engineers don’t necessarily need to be experts at all of these disciplines, but they can’t shy away from them. This is why our interview looks for a variety of technical skills and we also train engineers over time to gain more knowledge. Finding the unicorn that knows everything isn’t a goal, so we’re careful about expecting deep understanding of everything.

Flexibility

PEs play different roles at different times and need to understand when certain skills will be used. For example, on some teams, PEs need to be the communicator or the liaison. In other instances, they might need to be the problem solver, the debugger, or the fire fighter. Sometimes, they may need to code a critical component of a service. The role of a production engineer isn’t neatly defined in a box, and this is by design. We specifically chose to have a broad definition because we engage at different stages in the service life cycle that I described earlier. The team’s composition is a key factor in ensuring success in the role and we need to draw on each other’s strengths in different areas to solve the problem. We really stress not falling into the “not my job” mentality, and PEs need to be open to doing things that aren’t well defined in a job description.

Collaboration and compromise

Our model is that we work with the software teams, not for the software teams, and vice versa. On the one hand, we want SWEs to care about reliability, stability, and operations. On the other, we as PEs need to ensure we’re not always using the stability hammer and need to care about features and delivering new services. We all need to work together to do what’s right for the business, the service, and the team. Sometimes, that means that we’ll need to compromise on operational issues, and sometimes that means SWE will need to compromise on features. One dimension shouldn’t be the one that wins all the time. If there isn’t compromise when we work together, then this could lead to unhealthy working relationships. The PE leadership often talks about this concept with PE teams because it’s easy to get stuck in one mode of operating without taking the time to self-reflect and make sure that we’re not actually the ones causing the rift between Dev and Ops.

Willingness to teach and not be a SPOF [Single Point of Failure]

None of the roles we play are sustainable in the long term. PEs need to be focused on building software, establishing processes, and evolving over time so they can no longer be needed. PEs need to be careful to always be the one the team calls on to solve a specific type of problem or own a specific domain on their own. PEs need to ensure they’re building tools that will allow them to replace themselves.

Production engineering kind of resembles a trade within the engineering discipline, and understanding how to operate systems isn’t something that’s taught in school or university. Since it’s something that you learn on the job and through experience, it means that we have to teach others how to do what we know how to do. I have found that in the beginning of their careers, few software engineers are constantly thinking about failure or the “what if?” scenarios and building systems that will be resilient to these failures. There is definitely a mindset that comes with being a successful PE, but I strongly believe this can be learned behavior. It’s our responsibility to infuse those around us with this mindset. The more practiced software engineers know how to do this well, but the reality is that we’re adding more and more new software engineers into our ranks and I believe PE helps them gain the knowledge of how to build more resilient software faster. This also means you need to make room for this kind of training and development and not just expect quantitative output from production engineers on the team.

David: Can we talk a bit about how you train new production engineers? Facebook is famous for its onboarding Bootcamp; do production engineers go through a Bootcamp? Is there a PE curriculum that is part of Bootcamp?

Pedro: How we gained a stronger position in Bootcamp is the part of the PE origin story that I didn’t cover earlier. When I got to Facebook, operations wasn’t allowed to join the existing software engineering Bootcamp. I had read about Bootcamp and seen the videos before joining and was excited. However, when I joined, I was told: “You’re in operations, why would you ever want to commit code?” I was pretty irate. I spent the first year or so of my career at Facebook working hard to change this perception and prove folks wrong. I came from a computer science background, and I had written code my whole career. It felt unfair to be penalized because I chose to be in operations because I like to solve broad operational problems through software.

The first few weeks of an engineer’s time at Facebook is where they get taught the fundamentals of operating as an engineer: how to commit code, learn about Facebook’s code quality standards, work with our tools, deploy stuff into production, learn how to monitor it and add instrumentation, and more. This seemed like very relevant material that we also needed to understand, but since we weren’t allowed to attend Bootcamp, we initially had to build our own version. In parallel, some of us spent more time with the Bootcamp leaders and through many conversations and by hiring more people with stronger software backgrounds, we were eventually able to gain entry.

Bootcamp has now been part of the onboarding process of all production engineers for a long time. We have influenced a lot of the classes taught to both software and production engineers, so that everyone gets some fundamental operational knowledge about systems before they land on a team.

We follow the same model for team selection that software engineering does here at Facebook. PEs spend three or four weeks getting fundamental technical knowledge in Bootcamp, and then spend two-plus weeks on team selection. We hold a career fair, for lack of a better term, where all the teams that are looking for people get together with all the Bootcampers that are looking for teams. We try to find a match based on the needs of the team and the skill set and desire of the individual. Some people have an affinity for security, for example, while others have an affinity for product-facing services or backend systems. For the most part, this works fine but this team selection process doesn’t always work for everyone. Some people we hire want to know the team they’ll be working on before they join Facebook. So we spend a bit more time up front learning about them and then narrowing down the set of teams to make it a little easier.

Over time, we found that there’s still some type of work that doesn’t necessarily apply to every software engineer joining Facebook. So, we created another mini Bootcamp that we call PE Fundamentals. That curriculum is geared specifically toward production engineers, network engineers, and other operational-like teams. In Bootcamp, we’re trying to pack in a lot of information into a short period of time. I think it’s overall very successful, but the content isn’t as meaningful until the engineer has landed on their team and has spent a little more time understanding the infrastructure. About four weeks after PEs exit Bootcamp and have been on their team, they come back to a more hands-on set of classes that explains the nuances of our tools. Now armed with context of the system they’re working on, they can make better connections in their mind about how our tools apply to their work.

We also do more cultural onboarding in PE Fundamentals. For example, we cover how to apply some healthy tension without going overboard. When we’re talking to engineers about their solutions, we need to be careful about always being the naysayer because something might not be perfect. We can’t always say “No.” In these onboarding presentations, other PEs share their war stories so new folks can build examples in their mind for how to deal with potential disagreements on their team. PEs learn about being part of a culture that enables others to solve their problems as opposed to being a blocker to change.

David: Let’s go to the wider picture around production engineering as a model. We’ve talked about the uniqueness of the PE organization at Facebook. Do you think that a PE organization could be implemented outside Facebook and still feel like a PE organization to you?

Pedro: Yes, I think a similar PE organization can be created elsewhere. I believe that Facebook’s secret sauce is actually the people we hire and how they get things done. There are a lot of cultural factors that might need to be there, but I’d be wrong to assume that we’re the only company that has a culture of autonomy, independence, empowerment, and healthy debate, among others. I do worry that other companies may adopt monikers and don’t adopt the cultural implications that come with them. I’ve talked with a lot of teams outside some of the bigger companies that will build an SRE team, and in reality, it’s just a rebranded SysAdmin team with a completely separate Ops role and no expectation of automation, collaboration, and equality. I believe it’s because they want to be able to attract people externally, but the work doesn’t necessarily change. The engagement model with the software engineers doesn’t change. The ability to influence change isn’t different. The relationship at the leadership levels and the shared accountability doesn’t exist.

I strongly believe that software teams should ultimately be accountable for their operations, but I also recognize that PEs can help ensure this isn’t too hectic. I want to emphasize again that PEs should not be doing operations for SWEs, they should be doing it with them. If this construct of working with others instead of for others exists, then I think this model can work in other places.

David: So how do you know whether an org is a PE org in the same way you define it?

Pedro: I want to be clear that the way we’ve built and run our org isn’t the only way. Many companies are trying to figure out how to best run operations in their environment and it’s great if they seek other models to reference—we should all be learning more from each other. However, this model isn’t one-size-fits-all. When companies are trying to build their organizations, they should pick and choose the things that work for them and the things that can apply to their environment. If they choose to implement some of the concepts I’ve talked about, modify them, and make them their own, that’s great. If, however, they try to follow some set of strict rules, I can almost guarantee that it will fail because everyone is different, every company is different, and every infrastructure challenge is different.

With that being said, if I had to quickly summarize the ways to evaluate the PE model, here are the main things to consider:

  • Shared on-call with an embedded and collaborative model

  • Strong relationships with technical credibility

  • Balance between operations and features

  • Leadership accountable to delivering a feature-rich and stable system

Let’s break those down into a bit more detail.

When I want to gauge how close another team is to our PE model, the first question I ask is: “Who is on call? When things go down, who is responding to a service outage?”

If they respond that it’s SRE, production engineering, DevOps, or whatever, as opposed to software engineering (or a shared on-call rotation between SWE and the operational team), then I know they’re not building the PE model in the way we’ve defined it. In my opinion, the ultimate accountability for a service’s stability needs to fall on those who are building the service. If the primary builders are SWEs, then that’s who needs to be on call.

Another factor I look for in determining how closely someone’s implementation resembles our model is related to equality and perception. In my experience, there are software engineering teams that look down on operational work and, conversely, operational teams that don’t respect pure software work. Internally, I use a framework that highlights that Different != Bad.

Often, software engineers spend their mental time on algorithms and features instead of thinking about operational complexity. This isn’t inherently bad, but it can lead to perception issues from operations. PEs, on the other hand, spend mental time and energy on other things that aren’t purely software. Focusing on availability, scalability, operability, failure modes, security, deployments, reliability, and monitoring instead of software is also not inherently bad. The mental time spent is just on different things. It’s everyone’s responsibility to ensure there’s shared context with respect to decisions and actions being taken by everyone on the team. I look for this type of shared understanding in my evaluation of whether someone else is implementing a similar model.

My typical follow-up question is about prioritization of features over operational stability. “When faced with prioritizing between features and operational stability, how often does operational stability lose?” When a team wants to build a service and repeatedly chooses to overrule the stability work even though they understand that the operational debt is going to be high, then the team is not succeeding at this model.

This leads me to another set of questions related to when the operational team is engaged in discussions regarding the architecture of a system. If the software team views the PEs, SREs, etc. as equal contributors, then discussions related to architecture will happen with both groups of people in the room. If they aren’t seen as equal, and the operational team is consulted after the fact, then the implementation isn’t going so well.

I don’t think that any system or team will ever be perfect, but diversity of thought needs to exist in the team to build something that’s going to be feature-rich and resilient under pressure. This last factor has a lot more to do with the relationship than it does with the level of technical knowledge. If the relationship is adversarial or if there’s a large technical gap between the groups, the result will be a weaker system. If there’s trust between the groups, there’s shared technical understanding and the conversations are constructive, then a better, stronger system will be built.

When it comes to features versus stability, I think everybody should win a little and lose a little. The software team should sometimes deprioritize features to ensure stability. The PE team should sometimes deprioritize some gains in stability to ensure features are being built. Stability shouldn’t regress, but sometimes it might be OK to hold the line. The software teams need to continue to innovate and PE’s role is to help enable this innovation by reducing the operational load and solve operational problems through software. PE also needs to simultaneously work with the software team to reduce operational complexity. If PE is 100% focused on stability and reliability, then the software teams may dismiss them and the critical work won’t happen. The work needs to be balanced over a longer time horizon. For example, some months may be heavily skewed toward features and then the following months may be heavily skewed toward operational stability. As long as this is balanced, then the implementation is working fine.

The last component is related to accountability. If the system isn’t feature-rich and stable, are the leaders in software and operations held accountable together? When it comes to promotions and performance assessments, are these held to equal standards or are they different? At Facebook, for example, when we are assessing the performance of senior leaders in SWE and PE, we talk about them in the same set of discussions. Their impact, their ability to execute, work cross-functionally, and to build healthy organizations is held to the same standard.

As I said earlier, I do think that everyone’s implementation might be slightly different and they should pick and choose the things that work well in their environment. I’ve talked to a few companies, and while they aren’t doing the exact same things we are, overall, I think they’re being successful in their implementation.

Contributor Bio

With over 20 years of experience in software design, architecture, and operating robust services at scale, Pedro Canahuati is the Vice President of Production Engineering and Security at Facebook. In this capacity, Canahuati is responsible for ensuring Facebook’s infrastructure is stable and that the data of its over two billion users is secure. Throughout Canahuati’s career, he has built and managed global engineering teams with a focus on operationally scaling companies to provide users with the best experience.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.72.78