Chapter 13. Making the Case For DBRE

Throughout this book, we have attempted to show how the landscape of database engineering has shifted over the years. Within the context of that landscape, we have enumerated the operational and developmental disciplines that the database reliability engineer (DBRE) must be involved with and how to begin to do so. Finally, we attempted to lay out the current ecosystems of storage, replication, datastores, and architectures, or at least a reasonable subset, to broaden your minds and knowledge.

The reason we feel there is a real need for an emphasis on reliability in not just the job title of the DBRE but in everything they do is because the database is a place where risk and chaos simply has no place. A lot of what is now commonplace in our day-to-day work—virtualization, infrastructure as code, containers, serverless computing, and distributed systems—all came about from risk at areas of computing where risk could be tolerated. Now that they are ubiquitous, it is up to the stewards of one of the organization’s most precious resources, the data, to find paths to bring databases into these paradigms.

A lot of this work is still aspirational. There is only so much risk that can be tolerated within any organization when data comes into play. Thus, how we introduce these concepts to the rest of the organization, or how we respond to others doing so, becomes an actual discipline and job function for us. It is not enough to have the vision and the intent, we must simultaneously find ways to introduce this vision in such a way to be successful.

In this chapter, we show you how to shepherd a culture of database reliability into your organizations, now and into the future. You will get some ideas on ways to become involved in a wide range of functions within the organization while representing and speaking up for database reliability engineering.

A Culture of Database Reliability

What does a culture of database reliability look like and how can you promote it? There are many items that people think of when they think of reliability culture that are not specific to the database world:

  • Blameless post-mortems

  • Automating away repetitive work

  • Structured and rational decision making

This all makes sense, and everyone within an operations or site reliability engineer (SRE) organization should constantly be working toward this. But what should we, as DBREs or people who want to become DBREs, foster in our own environments? Here, we go over some approaches to become involved, to inject reliability culture, and to bring our expertise as DBREs to the rest of the organization.

Breaking-Down Barriers

The DBA who maintains isolation away from the other teams that interact with the datastore will simply not be successful. To be effective, we absolutely need to be active team members and partners at a much higher layer of abstraction than the one in which we have traditionally functioned. There is an inherent challenge in this because the database role is not one that is generally highly populated. Throughout the book, we have stressed that the database role simply cannot scale to the number of developers and operations engineers who will be working with them.

There are areas in which the DBRE can prove to be a highly effective member of cross-functional teams and work. Any time the DBRE becomes involved in cross-functional work, there should be a goal of contributing subject matter expertise, eliminating constraints on DBRE resources, or learning more about other functions to improve our own ability to function within the organization.

The architectural process

It goes without saying that those with deep database expertise should be involved more at all layers of the architectural process, particularly the design phase. The DBRE can provide incredibly valuable data on choosing the right datastore, preferably one that has already been significantly tested and proven in production. As we discussed in Chapters 7 and 11, the DBRE’s job is to truly vet the datastores that might be brought into service in their organization.

In larger organizations that require self-service in order to build and deploy services, the DBRE has a particular power in determining which storage services are put into the self-service catalog. By working with the rest of the technical organization, the DBRE can help support the mandate of using these approved services that have been tested thoroughly for edge cases, scale, reliability, and data integrity. Sometimes, this can also come with the caveat that all organizations can build and deploy services outside of the catalog, or at different tiers in the catalog, but that they must accept different Service-Level Agreements (SLAs) in order to do so. For example:

Tier 1 Storage

Tested in core services in production. Use of self-service patterns means that Ops and DBRE can provide 15-minute Sev1 response SLAs for escalation, and will guarantee highest SLAs on availability, latency, throughput, and durability.

Tier 2 Storage

Tested in production on non-critical services. Use of self-service patterns means that Ops and DBRE can provide 30-minute Sev1 response SLAs for escalation, and will guarantee reduced Service Level Objectives (SLOs) on availability, latency, throughput, and durability.

Tier 3 Storage

Not tested in production. Escalation from Ops and DBRE is on best effort, and no guarantees on SLOs can be guaranteed. Must be fully supported by software engineering (SWE) teams.

If your organization doesn’t support such self-service platforms, the DBRE must work harder on ensuring that every team is aware of the methods they use in evaluating datastores and architectures and the value that is provided by doing so. Although premature optimization is always a danger, one of the most important values provided by DBREs is helping to guarantee that architectural datastore decisions do not hamstring a service in the future as they hit scaling inflection points.

In many organizations, this begins as a mandatory checklist on a technical project that includes DB review in any project before it goes into production. However, it is easy for this to wait too far into the life of a project, simply beyond the point for you to change. This is where the art of politics and ambassadorship comes into play. Finding time to evaluate datastores and publish best practices, trade-offs and patterns for the most common or upcoming datastores in the organization is an important way of showing people the value of DBREs. It also is a great chance to work with SWEs and architects to build this data and publish it to build more mindshare. With this coming out regularly, when a new datastore that you have not evaluated comes into an architectural conversation, people will be more encouraged to sponsor an evaluation as part of their project and resources with you as a matrixed mentor and advisor. The key is getting people to see value and not constraint in using the DBRE.

There are metrics that you can use to measure how well DBREs and the organization are doing in this regards. Here are a few examples:

  • How many architectural projects used DBREs or their approved templates? (post mortem)

  • What storage was used and deployed? (post mortem)

  • How many hours of DBRE work occurred during each phase or user story? (post mortem)

  • Availability, throughput, and latency metrics grouped by storage tiers and engines to provide proof of reliability.

Database development

Much like architecture, DBREs being involved in database development early in the development cycle can be a force multiplier in the success of a project. We discussed a lot of this opportunity and value in Chapter 8. One of the biggest barriers to this is SWEs forgetting to discuss their designs with DBREs. Other times, SWEs might feel they do not need such guidance. One of the biggest wins in this struggle is to assist the SWE teams in seeing the value in the work that they do with DBREs.

Embedding DBREs, whether full-time or just at certain times in a project’s life cycle. Pairing with software engineers is a great opportunity to help those SWEs see the value of DBRE collaboration. Even if the DBRE is not particularly strong with coding, her input on data modeling, database access, and use of features can prove invaluable, while also building relationships between organizations. Pairing SWEs with DBREs who are performing reviews, implementations, or oncall to support data-driven applications can similarly foster relationships, empathy, and cross-team knowledge that will accelerate development.

We also discussed the importance of providing best practices and patterns for the functions that each SWE performs in integration with the data layer. For instance, you can implement checklists for models, queries, or feature usage that require the SWE to indicate whether he used patterns or not. These checklists can raise flags on stories or features that might need review prior to pushing into production.

There are metrics that you can use to measure how well DBREs and the organization are doing in this regard. Here are some examples:

  • Development pairing hours between SWEs and DBREs

  • User stories that used DBRE

  • Feature metrics, such as latency and durability, mapped to use of DBRE or not

  • On-call shifts with an SRE paired

Production migrations

Everyone wants more error-free migrations. There is no doubt about that. But often the rate of deployment rapidly outstrips the time DBREs have to support those deployments. Backlogs end up being bundled into large, fragile changesets that can introduce dramatic risk, or migrations are done without significant DBRE review. As we discussed in Chapter 8, an effective way to manage this is to build process and tools to enable SWEs to make better choices about what can be implemented through normal deployment mechanisms, what DBREs should implement, and what should be reviewed by DBREs when ambiguity is present.

The easiest first step is to incrementally create a library of heuristics that can indicate whether the changes are safe or dangerous. Even though the bottleneck is not removed immediately, creating a mandate to build this library together with DBREs will begin to build traction over time. Having a review board periodically going through the most recent changes, the resulting heuristics and guidance and the success or failure of those changes can prove to be an effective watchdog on this process. You can do this as part of regular post-mortems of changes, both successful and unsuccessful ones.

Another way to continue to incrementally enable the SWE organization to be as autonomous as possible is to build a database of migration patterns that can be applied heuristically to upcoming changes. By doing this with SWEs over time as you pair together with changes, you can build a living document that the SWEs not only use, but also feel enabled to build upon themselves over time. Again, doing post-mortems and reviews to validate the success of these patterns is crucial.

You can build further upon this by providing guardrails to implementations that heuristics and migration patterns indicate can be performed by SWEs. These guardrails give confidence to everyone—SWEs, operations staff, DBREs, and leadership. The more you show success and build trust, the further this can go. By enabling the autonomy of SWEs, you can find that your relationships with those teams become much healthier as you prove to be more of value than of hindrance. Continuing to impart the depth of your expertise in database storage and access will continue to optimize their velocity and the relationship. You can do this individually via pairing or via more educational approaches, such as workshops, knowledge shares, and documents.

No amount of enablement will change the fact that there are some migrations that the DBRE must take on. Even at this phase, there are approaches that can be done to educate and enable others. Again, pairing with the engineers during the migration planning and execution can be an excellent approach. Pairing with operations engineers as well can be highly valuable, as the more people who can assist and eventually own complex production implementations, the better.

Even without significant automation, there are plenty of ways for you and the DBRE team to continue to drive more reliable, error-free change in a way that does not cause the development pipeline to stagnate. The addition of technology, tools, and code can take this even further after trust and repeatability of manual processes has been refined.

There are metrics that you can use to measure how well DBREs and the organization are doing in this regard. Here are a few examples:

  • Migration pairing hours between SWEs/Ops staff and DBREs

  • Count of migrations requiring DBREs versus all migrations

  • Failure or success of migrations and impact

Infrastructure design and deployment

In the section on architecture, we discussed working with engineering to choose tested and trustworthy datastores. Similarly, you must work constantly with operations and infrastructure staff to ensure that they have everything necessary not only to host those datastores, but also to deploy and maintain them. In Chapter 5 we discussed the various parts of this function in detail, and in Chapter 6, we discussed the software and tools needed to manage those infrastructures at scale. But, we are still in the early stages of doing this for datastores, particularly distributed ones.

As with production implementations and giving software engineers more autonomy, so much of introducing this into the organization is about building trust through incremental steps. The first steps that can provide significant value to the DBRE team and the organization are using the same code repositories and versioning systems to manage your scripts, configuration files, and documentation. Then, you can work with the operations team to begin configuring and deploying empty datastores via configuration management and orchestration. This will still require you to finalize those datastores with the actual data, but it is an incremental step forward.

Throughout this, by pairing with operations, you can do testing for proper configurations, security testing, load testing, and even more advanced tests for data integrity and replication. The more you familiarize the entire team with how your databases work and how they break, the better. Availability and failure testing is also a critical test to bring in other organizations to work with.

Finally, you can begin to give primary on-calls to operations staff and even senior developers managing their own infrastructures. With you and your team mirroring and pairing with them, they can rapidly gain confidence in working with these infrastructures while minimizing risk. It is only when the team truly feels confident knowing the inside and out of maintaining all of this that you can begin to automate the riskier components, such as data loading, replication reworking, and primary node failovers.

There are metrics that you can use to measure how well DBREs and the organization are doing in this regard:

  • Count of infrastructure components that are managed via configuration management

  • Count of infrastructure components that are integrated into orchestration platforms

  • Count of successful and failed provisioning

  • Metrics on resource consumption—all subsystems used by the datastores

  • On-all shifts managed by non-DBREs

  • Incidents managed by non-DBREs and the Mean Time to Restore (MTTR)

  • Escalation counts to DBREs

So much of the success of this work is in relationships, empathy, trust, and shared knowledge. We know that many DBAs are used to functioning in isolation, but with these steps, you and your team can bring database work into the sunshine. No longer should it be a murky, scary function that only the bravest or most foolish engineers are willing to tackle. The key to this is repetitive exposure, constant incremental trust building, and pairing with others.

Data-Driven Decision Making

Trust cannot be built without excellent data on the impacts of changes. The Deming Cycle of plan, do, check, and act requires the observability we reviewed in Chapter 4. Remembering to define clear appropriate metrics for determining success before gathering baselines with every change and then finally taking the time to analyze the results with a skeptical eye is key.

Using your knowledge of the organizations SLOs as discussed in Chapter 2 and Chapter 3 is crucial to understanding the changes you must make and the metrics and results you need to prove to the rest of the organization the potential value that will drive the change and the resulting value of the effort to take them there.

Hopefully you find yourself in an organization that has already seen the value of data-driven decision making and thus has already implemented the platform for observing, the processes for analyzing, and the discipline to consistently execute. Similarly, we hope that you are in an organization that has already defined clear, useful SLOs to drive your decision making. But, if not, you will need to begin with these practices to be able to drive deeper and more potentially far-reaching changes.

Data Integrity and Recoverability

We discussed the criticality of data integrity and the ability to recover from loss or corruption in Chapter 7. Too often organizations view this as the responsibility of the DBRE, but we know that that is an impossible task for the DBRE organization alone. Being the champion of a data integrity program will often fall upon the DBRE organization. Convincing the SWE organization of the importance of allocating resources for data validation pipelines and recovery APIs is a constant responsibility. Serendipitously, if you are breaking down the barriers between architecture, software development, and the lack of DBRE involvement in early phases of work, you will find yourself with the relationships and the trust to incrementally build the shared code and the knowledge required to implement an effective data integrity pipeline.

This is not an easy sell. Our experience with this is that most SWEs feel data integrity is the domain of the DBRE only. And constrained organizations will balk at the development of validation pipelines and recovery APIs. Thus, you will find yourself having to make the case while implementing “poor man’s” solutions that can gather the data around data-integrity issues. Similarly, tracking the efforts taken for manual recovery of data can go a long way toward convincing leadership to commit resources for recovery APIs and validatoin pipelines.

As you can see, the successful evolution of database reliability requires incremental, and comprehensive organizational shifts. Choosing the areas that consume the most of your time, and that create the largest constraints on other organizations is a skill you would be smart to practice. Then, building incremental points of change to build trust and create improvements will build momentum. But this all takes time, trust, and a lot of experimentation regarding what works for your organization’s risk levels, and what doesn’t.

Wrapping Up

We’d like to thank you for taking the time to read through this book. Both of us are so passionate about evolving one of the most burdensome and byzantine of technical careers. Although a good portion of this book is aspirational, or still being proven in the wild, we believe that the DBRE movement is one that can drive so much value to data-driven services and organizations.

Our hope is that you are inspired to explore these shifts in your organization and that you are eager to learn more. We have tried to give further reading and exploration options throughout, as this framework is flexible. But, most important, we hope we’ve helped you see that there is opportunity to bring the time-honored role of DBA into the modern world and into the future. The role of DBA isn’t going away, and whether you are new to this career or a tried-and-scarred veteran, we want you to have a long career ahead of you as you drive value to every organization you are a part of.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.127.141