Chapter 2. Understanding the SRE Role

Culture/Capabilities/Configuration

In Innovation Prowess (Wharton Digital Press), George S. Day, a business professor at Wharton, identified a framework for the underlying components in highly innovative companies. He classified them into the “three C’s”:

  • Culture. An organisation’s shared values and beliefs, defining appropriate and inappropriate behaviours. It is often summed up simply as “the way we do things around here.”

  • Capabilities. The combination of skills, technology, and knowledge that allows the firm to execute specific activities and innovation processes.

  • Configuration. The structure of the organisation, including how resources are allocated, who bears responsibility for achieving targets, and how success is measured.

Day explores the ways in which these components support each other to enable ongoing innovation practices that persist through both success and failure. While Day was focused on the aspects that lead to innovation, these same dynamics apply to reliability. Adapting his points:

It takes sustained leadership and the long-run commitment of finance and human resources to build [reliability] prowess. Success begets success: Prowess improves the more it is applied….

[The] elements are mutually reinforcing. They don’t simply add together; instead they are multiplicative, as a weakness in one afflicts the others….

Culture underlies and infuses everything an organization does….Culture and capabilities have a symbiotic relationship—one can’t function without the other. They also have to be closely aligned to get superior results.

Culture

As covered earlier, clear and unambiguous support for site reliability is an absolute necessity for a successful SRE implementation. High (enough) executive-level support can be shown through organizational structures, but a culture that recognizes and rewards work that enhances reliability provides the environment in which SRE can be viable. When things are going badly, the organization must demonstrate its commitment to reliability by reallocating engineering resources across the board to address the deficiencies. Other important cultural components include:

  • Fostering a learning mindset1

  • Always looking for continuous improvement2

  • Establishing psychological safety3 to enable truth telling

Capabilities

For SREs, the ideal team member has a broad understanding of computer system dynamics—especially distributed systems. Effective SREs are able to zoom out to deal with system interrelationships and to zoom in, as needed, to debug the bit-level intricacies of networking or memory usage patterns.

They are adept at applying the leverage of coding and automation for scale. SREs need to have the ability to marshal data and present it to their partner teams in ways that are understandable within the context of the work priorities of the feature development teams.

The skills of empathy and compassion referred to in Effective DevOps are also important skills for an SRE because of the highly collaborative nature of the role. In order to minimize the impact of outages, SREs should be able to function effectively under the pressure of failing sociotechnical systems and both identify and implement methods to improve future responses.

While many individuals may exhibit some or many of these capabilities, the supportive underlying culture reinforces a learning mindset and supports actively practicing the skills. Active skill practice develops organizational “muscle memory,” leading to greater capacity.

Configuration

Finally, in the configuration space, a strong SRE practice requires reporting structures that allow SREs to be evaluated and rewarded according to the distinct measures of performance that matter the most to them (not just how quickly features get shipped). Note that these reporting structures may be local distinctions, matrix-based, or a fully independent organization within engineering and still provide effective evaluation and recognition incentives. SRE success can be tracked across five areas that contribute to reliability:

  • Providing useful monitoring frameworks that empower system understanding

  • Characterizing, measuring, and improving availability and performance

  • Accurately forecasting capacity requirements and improving efficiencies without undue impact on feature deployment velocity

  • Improving velocity through reduction of toil and manual exception handling4

  • Effectively handling and learning from incidents

Distinguishing SRE from Other Operational Models

SRE is the latest in a historical progress of operational models, so let’s look at how it differs from previous approaches.5 Just as earlier approaches were products of their time and context, so is SRE.

SRE Versus “Classical” SysAdmin

The system administrator (SysAdmin) role initially developed within the context of academic and research computing. In that context, SysAdmins benefitted from the deep systems knowledge around the role as well as the need to figure things out on their own when something went wrong.

Many SREs come from prior experience as SysAdmins. The troubleshooting skills and systems knowledge that they obtained from that background are highly valuable contributions to their SRE teams, but the focus of an SRE is more narrowly scoped than that of a SysAdmin.

SREs mainly focus on the operational characteristics of the applications that they participate in designing and supporting. Deep-level systems knowledge may be called upon to achieve the goal of service reliability or in troubleshooting aberrant application behavior.

SRE Versus “Classical” Ops

As computing was adopted into enterprise, bureaucratic contexts that retained components of Taylorist management constructs, SysAdmins morphed into “ops” people. Ops is charged with keeping things running (stable) but not with doing the important “engineering” work.

Classical ops is divided by a mostly impermeable wall from the feature development teams. Depending on the age and size of the organization the barrier between the teams may be larger or smaller, but it tends to increase with the growth of specialization in the company’s engineering organization. Usually, this divide between the two groups is accompanied by some degree of dismissiveness toward the “other” group. When this dismissiveness is carried up the management chain, the numerically smaller group (usually ops) ends up as the loser.

The SRE model restores the importance of engineering work that was lost with the ops phase and relies on close engagement with the feature development teams. Reliability is enabled by involvement throughout the full life cycle of a service—from conception through full production and on to retirement. Effective engagement with the dev teams is fostered by ensuring that incentives are aligned. All parties need to be constantly aware of the performance and reliability of the services. Any measures that insulate developers from the full costs (especially noneconomic costs) of keeping the service running in production end up building defensive and anti-productive walls between teams.

SRE for Internal Services

While most people may initially think of a company’s products or services as mainly external-facing, SRE practices can be equally beneficial when applied to internal-facing platforms.

SRE for Backend or Platform Services

SRE is possibly even more relevant in a situation where everything becomes a platform than for standalone functionality (see Dave Rensin’s 2017 SREcon talk for a discussion of why all online services are evolving toward being platforms for other services to make use of for their own needs and service levels). This “platformization” is more likely to be explicitly designed in the case of many internal services, and reliability aspects such as including SLIs and SLOs provide a common language to set expectations between teams.

Without well-established service expectations and management of outages, a microservice architecture or a system built on a large number of outsourced subservices turns into a fragile house of cards. User-facing services can’t truly provide higher reliability than that provided by their critical backend dependencies, and those dependencies—particularly ones that may be several layers removed—may not be directly visible to or accessible by developers. Adequate feedback loops help to avoid cascading problems and the so-called “dark” debt within distributed systems.

SRE for Databases (DBRE)

In Database Reliability Engineering (O’Reilly), Laine Campbell and Charity Majors make the case that the collaborative principles of reliability for online services also apply to databases. In the case of a database, the prime directive is slightly adjusted from “site up” to “protect the data.” See Table 2-1 for a comparison of the typical database administrator (DBA) approach with that of database reliability engineering (DBRE).

Table 2-1. Comparing classic DBA with DBRE
Classic DBA DBRE

Strict separation of duties

Collaborative work, sharing the data protection responsibilities

Expensive and specialized software and storage hardware

Open source software, commodity hardware with durability requirements guiding the selection of each

Extensive change control processes

Automated procedures with rollback and impact mitigation components

SRE for Security

Just as an online service is pointless if it is not indeed “online” and accessible for its users, it is possibly even worse (harmful) if the service is not properly secured to protect itself and its users from attacks. For example, at LinkedIn the goal of what SREs do is “site up and secure.”

There are a lot of people on the internet with nothing better to do than break into services or cause havoc, and as increasingly valuable information becomes accessible online, theft and destruction have become economically attractive activities. Because of this, reliability for security measures is even more important than for basic use features. Authentication and authorization are fundamental tasks within the realm of security, but they have often been approached on a more or less manual basis.6 Developing automation, metrics, and monitoring in order to support “continuous security” can fall into the domain of SREs.

SRE for Internal IT?

Lampooned, underresourced, overcommitted, the plight of the typical corporate IT team is illustrated by Scott Adams’s character Mordac, the Preventer of Information Services.

The importance of information technology to the daily work of practically everyone in a company is huge: “When the email [or substitute any other ubiquitous internal service] stops, everyone goes home.”

Corporate IT has been populated by lots of specially built, one-off systems for unique purposes (the dreaded “snowflake” systems). Applying SRE principles can help to improve the delivery of value for internal business services, just as it helps with user-facing services. The SRE approach would bring such a rabble of differently configured systems into the regularity of an automated systems configuration. This regularity can help overcome the lack of investment with the efficiency gains of automation and duplication.

1 See “Carol Dweck: A Summary of the Two Mindsets and the Power of Believing That You Can Improve” and her book Mindset: The New Psychology of Success (Ballantine Books). Also see Peter Senge’s The Fifth Discipline: The Art and Practice of The Learning Organization (Doubleday).

2 See Chapter 11 of Accelerate, by Nicole Forsgren, Gene Kim, and Jez Humble (IT Revolution Press).

3 See Project Aristotle and Chapter 27 of Seeking SRE by David Blank-Edelman (O’Reilly).

4 The antipattern version of this is referred to as “feeding the machine with the effort and toil of humans.” Working in that way is not only inhumane but does not effectively scale, because you simply can’t hire enough people to keep up with the demand—and it would be prohibitively expensive.

5 These characterizations are necessarily simplified in the interest of being succinct. There are many variations and a range of overlapping practice for all of the described roles which can make it difficult to distinguish one from another.

6 As a recent example, during the January–February 2019 shutdown of the US government, many government websites had TLS certificates that expired because no one was working. This caused a denial of service to anyone unwilling or unable to override the expired certificate warnings in their browsers. SREs would automate the certificate handling so that no manual work would be involved to keep the certificates current.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.140.5