Chapter 8. Release Management

As we automate and ease the burdens of infrastructure management, the database reliability engineer (DBRE) is able to devote more time to the highly valuable parts of their job. One of these high-value activities is working with software engineers to build, test and deploy application features. Traditionally, the database administrator (DBA) would be a gatekeeper to production. They would expect to see each database migration, database object definition, and code accessing the database to ensure that it was done correctly. When satisfied, the DBA would plan an appropriate hand-crafted change and husband it through into production.

You might already be thinking that this is not necessarily a sustainable process for environments experiencing significant amounts of deployments and changes in their database structures. In fact, if you’ve been part of one of these processes you are already keenly aware of how quickly a DBA can go from gatekeeper to bottleneck, leading to burnout on the DBA end and frustrations in software engineering.

Our goal in this chapter is to look at how DBREs can effectively utilize their time, skills, and experience to support a software engineering process that utilizes continuous integration (CI) and even continuous deployment (CD) without becoming a bottleneck.

Education and Collaboration

One of the first steps the DBRE must take is educating the developer population about the datastores with which they are working. If the SWEs can make better choices about their data structures, their SQL, and the overall interaction strategies, there will be fewer needs for direct intervention by the DBRE. By taking on the role of conduit of database knowledge to the SWE teams, you can have quite a significant impact on the continuous learning processes of your peers. This also fosters better relationships, trust, and communication, all things critical to the success of the technical organization.

To be clear, we are not advocating that the DBRE attempt a hands-off relationship with the software engineering team. Instead, we are suggesting an interaction wherein the DBRE uses regular interactions and strategic efforts to create a knowledgeable team that has access to resources and can function autonomously for a high degree of its day-to-day decisions with respect to the databases.

Remember to keep everything that you do specific, measurable, and actionable. Define key metrics for your team’s success in this, and as you implement strategies and changes, see how they help the team. Some key metrics to consider in this process are:

  • Number of database stories that require DBRE interaction.

  • Success/Failure of DB story deployments.

  • Feature velocity. How quickly can a SWE get a story into production?

  • Downtime caused by DB changes.

Agile methodology and DevOps cultures require cross-functional interactions between people of different backgrounds, skill levels, and, of course, professional contexts to collaborate closely. Education and collaboration are a huge part of this process, and are great opportunities for you as the DBRE to shift out of the legacy “DBA” mode and become an integrated part of your technical organization.

Become a Funnel

You will undoubtedly find yourself following blogs, twitter feeds, and social accounts of people and organizations that you consider to be exceptional in the world of data and databases. In doing this, you will find articles, Q&A sessions, podcasts, and projects that have relevance and value to what you and your teams are doing. Curate these and share them. Create a regular newsletter, forum, or even a channel in chat where you can post relevant information and bring it up for discussion. Show the engineering team that you and the other DBREs are invested in their success and continued development.

Foster Conversations

The next step is to create active dialogue and interactions with software engineers. This is where you and the team begin to dig into relevant content that you have shared to generate ideas, learn to apply the information and even improve upon it by identifying gaps and teaming up for further study and experimentation. There are multiple ways to do this, and it will depend significantly on the culture of learning and collaboration in your environment. Here are just a few examples:

  • Weekly tech talks

  • Brown bag lunches

  • Online AMAs (“ask me anythings”)

  • Chat channel focused on knowledge sharing

Similarly, you can hold open office hours, where people are encouraged to ask you questions, interact on specific topics, and explore things together.

Domain-Specific Knowledge

Although the previous components provide general foundation and knowledge relevant to the appropriate datastores and architectures in use at your organization, there is still a need for knowledge transfer specifically related to your organization’s domains.

Architecture

We are not fans of documentation that is static and untied to the processes that actually build and deploy our architectures. With configuration management and orchestration systems, you get a lot of documentation for free that is always up to date. Putting tools on top of these to allow for easy discovery, borrowing, and annotation of notes and comments creates a living, breathing document for teams.

On top of this comes the ability to understand context and history. There is a reason that certain datastores, configurations, and topologies are made. Helping engineers find out what architecture is in place, why it is in place, how to find documentation about how to interact with it, and, finally, what trade-offs and compromises have been made to get to where we are now.

As the DBRE, it is your job to make this knowledge, context, and history available to engineers who are making decisions daily while working on features without your oversight. Building a knowledge base of design documents creates the structure necessary to build context and history around the architecture. These documents can apply to full projects that require new architectural components, or they might relate to smaller incremental changes or subprojects. For instance, you would definitely need a design document to show the process of moving from statement- to row-based replication, but it would not necessarily have the same requirements as the first Kafka installation to support building a distributed log file for event-driven architectures.

Creating and disseminating templates for these documents is a team exercise. It is critical, however, that certain pieces of information are included:

Executive summary

For those looking for the basics.

Goals and anti-goals

What was expected out of this project? What was out of scope?

Background

Context a future reader might need.

Design

From high-level to quite detailed, you should find diagrams, sample configurations, or algorithms.

Constraints

What did you need to keep in mind and work around, such as compliance for PCI, IaaS-specific needs or staffing?

Alternatives

Did you evaluate other options? What methodology did you use and why were they discarded?

Launch Details

How was it rolled out? What problems arose and how were they managed? Scripts, processes and notes go here also.

As you can see, these documents can potentially grow quite large. For some projects, that is ok. Distributed systems and multitier services are complex and there is a lot of information and context that must be absorbed. Remember that a big goal here is giving that context to engineers without requiring more time from you than necessary.

Data model

Just as important as the architecture, data flow and physical pipelines is the information about the kind of data that is being stored. Letting software engineers know what kind of data is already stored and where they can find it can eliminate a significant amount of redundancy and investigative time from the development process. Additionally, this allows you to share how the same data should be represented in various paradigms—relational, key–value, or document oriented. This is also the opportunity to give best practices for which data stores are not appropriate for certain kinds of data.

Best Practices and Standards

Giving engineers standards for the activities they engage in regularly is another effective method for optimizing the amount of value you are able to generate. You can do this incrementally as you help engineers and make decisions. Some examples of this include the following:

  • Datatype standards

  • Indexing

  • Metadata attributes

  • Datastores to use

  • Metrics to expose

  • Design patterns

  • Migration and DB change patterns

Publishing these as you work with engineers allows for a self-service knowledge base, accessible at any time rather than forcing teams to bottleneck on you.

Tools

Giving software engineers effective tools for their development process is the ultimate enabler. You might be helping them with benchmarking tools and scripts, data consistency evaluators, templates, or even configurators for new data stores. Ultimately, what you are doing is enabling greater velocity in the development process while simultaneously freeing up your time for higher-value efforts.

Following are some excellent examples of tools:

  • Etsy’s Schemanator

  • Percona Toolkit, particularly online schema changes

  • SQL Tuning and Optimzation suites

  • SeveralNines Cluster Configurator

  • Checked in Change plan templates and examples

  • Checked in migration scripts and pattern examples

  • Benchmark suites for easy testing, visualization, and analysis

Treat software engineering teams as your customers, and practice lean product development. Get them a minimally viable tooling on how to do their jobs and consistently interview, monitor, and measure their successes, failures, pain points, and wishes. This will guide you toward what tools will give them the most benefits.

Collaboration

If you’re regularly educating, creating tools, and empowering engineers, good relationships will naturally be created. This is critical, because they lead to ongoing collaboration. Any software engineer should be empowered to reach out to the DBRE team to ask for information or for the chance to pair while they work. This gives great value bidirectionally, as the software engineers (SWEs) learn more about how the DBRE team works and what they look for, and the DBRE team learns more about the software development process.

DBREs can facilitate this further by proactively reaching out to engineers. There are stories that obviously have a large dependency and reliance on DB development and refactoring. This is where DBREs should be focusing their efforts, to help guarantee success and efficiency. Ask to pair or be part of a team on these stories. Similarly, keeping an eye on migrations being committed into mainline will help the DBRE team cherry pick where it needs to perform reviews.

It goes without saying, but making sure that DBREs are not segregated into their own cave, deep underground and away from where code is being built will help ensure that this collaboration can actually occur. Rotating DBREs and SWEs into each other’s projects and work can also do this.

Throughout this section, we have been discussing ways in which you can help software engineers in the development process to be as self-sufficient as possible. As development teams grow, you need to utilize effective education, standards, and tooling to ensure that your teams are making good decisions without needing your direct intervention. Simultaneously, you are able to educate engineers regarding when they do need you to review upcoming changes and solutions so that you can assist at the appropriate times.

Next, we discuss how to effectively support the various components of the delivery pipeline as DBREs. Although Continuous Delivery (CD) is not a new concept by any means, organizations have struggled to incorporate databases into the process. In each of the following sections, we discuss how to effectively introduce the database layers into the full delivery cycle.

Integration

Frequent integration of database changes allows for smaller, more manageable changesets and creates quicker feedback loops by identifying breaking changes as soon as possible. Many organizations strive for Continuous Integration (CI), enabling automatic integration of all checked-in changes. A large portion of the value of CI is the automated tests that prove that the database is meeting all expectations for the application. These tests are applied any time that code is committed.

Throughout the software development life cycle, any change to the database code or components should trigger a fresh build, followed by integration and its tests. You and the software engineering team are responsible for establishing the working definition of the database. Integration continues to verify that the database remains in a working state while the software engineers refactor the data model, introduce new datasets, and find new and interesting ways to query the database.

Doing CI for the database tier proves to be very challenging. In addition to the functional aspects of any applications utilizing database objects, there are operational requirements around availability, consistency, latency, and security. Changes to objects can affect stored code (functions, triggers, or views among others), and even queries from other parts of the applications. Additionally, advanced features, such as events, in databases can create more fragility. Beyond testing for functionality, there are numerous potential edge cases involving data integrity. Even though sometimes constraints can enforce integrity, these rules must be tested. Even more concerning are environments in which no constraints exist.

Prerequisites

To establish CI at the database level, there are five requirements that you must satisfy. Let’s take a look at each one.​

Version control system

Just as with infrastructure code and configurations, all database migrations must be checked in to the same version control system (VCS) as the rest of the application. It is critical that you are able to build from the latest configurations in order to understand how a recent database configuration change could potentially disrupt and break your application builds in new and interesting ways.

At this risk of being redundant, everything must be checked in to code. This includes the following:

  • DB object migrations

  • Triggers

  • Procedures and functions

  • Views

  • Configurations

  • Sample datasets for functionality

  • Data cleanup scripts

This provides a lot of useful things outside of CI:

  • You can easily find all related items in one place.

  • It supports all of the automated builds necessary for automated deployments (see Chapter 6).

  • You can find all history and versions of the database, which helps for recoveries, forensics, and troubleshooting.

  • You know that the application and database versions will be synchronized, at least in an ideal world.

As you continue to run integrations, validating checked in code and infrastructure changes against known working state, software engineers can apply the latest database versions to their development environments.

Database build automation

Assuming that you’re utilizing the configuration management and automation techniques discussed in the Chapter 6, you should be able to automatically build databases for integration. This includes applying the latest data definition language (DDL) scripts and loading representative datasets for testing. This can prove more challenging than you might expect because production data often must be cleaned up or scrubbed to ensure that no compliance issues occur with exposing customer data.

Test data

Empty databases almost always perform extremely well. Small datasets often do the same. You will need three different sets of data. First is all metadata needed for lookup tables. This is where you find IDs for customer types, location IDs, workflow, and internal tables. These datasets are generally small and crucial for the application to work correctly.

Next, you need a working set of functional data, such as customers or orders. This is generally just enough to let those quick tests that run in the early stages of integration succeed before investing time in the more intensive tests.

Finally, you need large datasets to help understand what things look like under production load. These usually need to be built from production sets and scrubbed to ensure that you don’t expose customer data, accidentally send emails to thousands of users, or other interesting and exciting opportunities for customer and legal interaction.

Metadata and test datasets should be versioned, checked in, and applied as part of the builds. Larger datasets often come from production, and the scripts needed to restore and cleanse the data should be versioned and checked in to ensure that there is synchronization between application and persistence layer.

Database migrations and packaging

This is all presupposed on the concept that database changes are applied as migrations (incremental coded changes). Each set of changes, such as an alter table, or adding metadata or adding a new column family will be checked in and given a sequence number. Because all changes are applied sequentially, you have a version number for the database at any time based on the most recently applied migration.

Traditionally, DBAs would either get a list of changes from developers or do a schema diff between development and production to get the information needed for them to apply the necessary changes for a release. The benefit of this is that large, potentially highly impactful changes can be managed very carefully by the DB specialists. This can minimize potential downtime and impacts during expensive migrations.

The negative side of this traditional approach, however, is that it can be challenging to see which changes map to which features. If something must be rolled back, it can be challenging to identify the incremental database change related to the particular feature. Similarly, if a database change fails, all features waiting on those changes will be delayed, affecting time for stories to get to production.

The incremental approach allows for all of the things we want from agile approaches: rapid time to market, incremental and small changes, clear visibility, and fast feedback loops. But this means that SWEs must be more knowledgeable about creating safe migrations, and about when they should get the DBRE team in to help them. Additionally, there is risk that migrations might conflict. If two SWEs are modifying the same objects, their migrations would run serially which could cause two alters instead of one. If the object has a lot of data in it, this could greatly increase migration time. You must consider trade-offs in these cases, which means that SWEs must be aware that they are potentially stepping on one another’s toes.

CI server and test framework

It is assumed that your software integration is already utilizing these things. A good CI system will provide all of the necessary functionality for integration. Testing frameworks will provide both the system level tests as well as the code component tests.

At the system level, frameworks such as Pester for Windows or Robot for Linux are available. Additionally, you can utilize Jepsen, a distributed systems testing framework specifically built to validate data consistency and safety in distributed storage.

With these prerequisites, you can begin the work of using your company’s CI platform for database migrations. As the name implies, continuous integration means that anytime a database change is committed, integration is performed automatically. For this to happen and for the engineering team to be confident that the changes will not adversely affect the application’s functionality and service-level expectations, testing becomes the key tool.

Testing

So, you have all engineers checking their database changes into the VCS. The CI server is able to trigger automated database builds synchronized with the application releases and you have a testing framework. What’s next? We need to verify that integration works and what kind of effects it will have in the next phase: deployment.

Unfortunately, we’re here to tell you that this stuff is hard! Database changes are notorious for affecting huge amounts of code and functionality. That being said, there are ways to build applications that can make this easier.

Test-Friendly Development Practices

When designing development processes, you can make things easier for testing with any number of choices. We’ve included two here as examples.

Abstraction and encapsulation

There are numerous ways in which you can abstract database access away from SWEs. Why would you do this? Centralizing database access code creates a standard, easily understood way of implementing new objects and accessing objects. It also means that you don’t have to find code all over the code base in order to make a database change. This simplifies testing and integration tremendously. There are a few different ways to do this abstraction:

  • Data access objects (DAOs)

  • APIs or web services

  • Stored procedures

  • Frameworks meant for this

With these in place, your integration can focus on testing the primitives around accessing and updating data first, to see if changes have affected the ability to use them. As with any testing, you want high-impact, quick-execution tests first, and centralized data access code makes this much easier to accomplish.

Being efficient

Often you might find engineers using a “select *,” or retrieving an entire row of an object to work with. This is done to “future proof” or ensure that whatever might need data gets it. Perhaps they want to be sure that if an attribute is added to the object, they automatically retrieve it. This is dangerous, and like any “future proofing,” is wasteful and puts applications at risk during changes. A “select *” will retrieve all columns, and if code is not ready to handle that, it will break. All of the data retrieved also must be shipped over the network, which requires more bandwidth if you are retrieving multiple rows and begin overfilling your TCP packets. Being selective about what you are retrieving is crucial. You can modify object access code when the right time comes, and you’ll be prepared for it when it does.

Post-Commit Testing

The goal of post-commit testing is to validate that changes apply successfully and that the application is not broken. Additionally, impact analysis and rules-based validation for security and compliance can occur at this level. After code has been committed, the build server should immediately build an integration datastore, apply changes, and begin a series of tests that are quick enough to run that the feedback loop to engineers is as tight as possible. This means a quick database build using a checked-in minimal dataset containing all necessary metadata, user accounts, and test data necessary to exercise the appropriate functions on all data access objects. This allows for a fast turnaround to engineers to see if they’ve broken the build.

Early in an organization’s life, much of this might be done manually. As rules come into play, tools and automation can be applied to make these processes faster and more bulletproof.

Pre-build

Prior to applying changes, the following validation against established rules for impact analysis and for compliance can be performed at this time:

  • Validation that SQL is formed correctly

  • Validation of the number of rows potentially affected by changes

  • Validation of index creation for new columns

  • Validation that defaults are not applied to new columns on tables with existing data

  • Validation of impact to stored code and referential constraints

  • Report that sensitive database objects and attributes are being updated

  • Report when rules required for compliance are being violated

Build

When the build runs, validation of SQL occurs again. In this case, based on actual application of the changes rather than rules based analysis.

Post-build

After the changes have been applied to the build, you can run functional test suites. You can also create reports that show analysis of impact and any rules violations that occurred in the change.

Full Dataset Testing

It is assumed that after the application runs against a full production dataset, there is the potential for the service to no longer meet service-level expectations. This means that the test suite should be run against production datasets at appropriate loads. This requires more preparation and resources to do, so this test suite can be scheduled asynchronously from the standard commit integration tests. Depending on the frequency of integration and code pushes, you might find a weekly or even daily schedule makes the most sense for these tests.

The steps taken for this extensive testing vary but will generally follow a blueprint such as the following:

  • Provision datastore and application instances

  • Deploy code

  • Recover full dataset

  • Anonymize data

  • Hook up metrics collection

  • Apply changes to the datastore

  • Launch functional, quick tests

  • Perform load tests, ramping up concurrency

  • Deprovision instances

  • Post-test analysis

Some things that you will want to look at in these tests include the following:

  • Latency changes for tests compared to previous runs on smaller datasets

  • Database access path changes in optimizers that could affect latency or resource utilization

  • Database metrics that indicate potential performance or functionality impacts (locking, errors, waits on resources)

  • Changes in resource utilization from previous runs

You can automate some analysis, such as registration of queries into a centralized datastore and comparison of historical plan changes. Some, such as metrics analysis would require an operator to review and perform effective review to determine if any changes are passable or not.

If any red flags come up, automated or not, the DBREs are able to narrow down the changes requiring analysis by reviewing changes applied since the last test run. Although this does not allow for immediate flagging of a specific committed change, it does allow for a much quicker identification.

In addition to fast and slow analysis of applications, there are additional tests that must periodically be performed on a rapidly evolving datastore. These tests ensure that the evolving database will continue to be a good citizen in the overarching ecosystem. These are downstream tests and operational tests.

Downstream Tests

Downstream tests are used to ensure that any data pipelines and consumers of the datastore are not adversely affected by any changes applied as part of the migrations. Like full dataset testing, downstream tests are best done asynchronously from the commit process. Here are some examples of downstream tests:

  • Validating event workflows triggered by data in the database

  • Validation of extraction, transform, and loading of data into analytics datastores

  • Validation of batch and scheduled jobs that directly interact with the database

  • Validation that job times have not increased significantly, potentially affecting delivery at specific times for required delivery or downstream processes

Similarly to full-dataset testing, these tests are often much more extensive and require larger datasets for consumption. By running them asynchronously but regularly, it is easier to identify potential changes that have affected the downstream processes that have been flagged in testing. If tests are failed, pushes to production can be stopped, and DBREs can have tickets automatically put in queue for them when rules are violated.

Operational Tests

As datasets increase and schemas evolve, there are opportunities for operational processes to run longer and potentially fail. These process tests include the following:

  • Backup and recovery processes

  • Failover and cluster processes

  • Infrastructure configuration and orchestration

  • Security tests

  • Capacity tests

These tests should regularly perform automated builds from production datasets, with all pending and committed changes applied before tests are run. Failed tests can advise the build server that there is a problem that must be evaluated and resolved before changes can be pushed to production. Although it is rarer for database changes to affect these processes, the impact can be severe at the service level, and thus require a high level of diligence.

With a combination of continuous, lightweight builds and tests, and strategically scheduled more-intensive tests, you can foster a greater degree of confidence in reliability engineering, software engineering, operations, and management that database changes can be safely introduced into production without direct intervention by the DBRE team.

These integration processes are a perfect example of DBREs providing high amounts of value through process, knowledge sharing, and automation to empower software engineers without bottlenecking them. In the next section, we will discuss the largest elephant in the room: deployment. Recognizing that a database change is safe is the first step, but safely getting those changes into a production environment is just as important.

Deployment

In the previous section on integration, we touched on the concept of database migrations and some of the pros and cons. Because we discussed how significant these can be, it makes sense to decompose data migrations in such a way that SWEs can easily and incrementally modify environments safely. Or, at least in as safe a way as possible.

In an ideal world, our goals should be to empower SWEs to recognize when their database changes require analysis and management by DBREs in order to be effectively introduced into production. Additionally, we would be able to give those engineers the tools to safely and reliably introduce most changes into production themselves. Finally, we would give SWEs the ability to push their changes to production at any time, rather than during restrictive maintenance windows. Creating a reasonable approximation of this world is what we will review in this section.

Migrations and Versioning

As discussed in “Prerequisites”, each changeset that is applied to the database should be given a numeric version. This is generally done with incrementing integers that are stored in the database after a changeset is applied. This way your deployment system can easily look at the database and discover the current version. This allows easy application of changes when preparing to push code. If a code deployment is certified for database version 456, and the current database version is 455, the deployment team knows that it must apply the changeset for 456 prior to pushing code.

So, an SWE has committed changeset 456 into the code base and integration has been successfully run with no breaking changes. What comes next?

Impact Analysis

We discussed impact analysis in the previous section under post-commit testing. Some impacts, such as the invalidation of stored code in the database, or violation of security controls, are gates that cannot be passed. The SWE must go back and modify her changes until these impacts have been mitigated.

In this section, we discuss the impact of performing the database migration on production database servers. Database changes can affect a production service in multiple ways.

Locking of objects

Many changes can cause a table or even a group of tables to be inaccessible for writes, reads, or both. In these cases, the amount of time the objects are inaccessible should be estimated and determined to be acceptable or not. Acceptable locking really will be part of the service-level objectives (SLOs) and business needs, and thus is subjective. Previous changes on these objects can be recorded with specific metrics around the time it took to run the change. This will allow some objective data to be used to determine impact time, even though the time it takes to do changes to objects will lengthen as dataset size and activity increases.

If unacceptable, the DBRE should work with the deployment team to determine a plan to either reduce the time to a point of acceptability, or to redirect traffic until such time as the change has successfully finished.

Saturation of resources

A change can also utilize significant amounts of Input/Output (I/O), which can increase latency for all other transactions utilizing the datastore. This can cause service levels to be violated, and eventually cause processes to back up to a point at which the application becomes unusable and other resources are also saturated. This can easily cause a cascading failure.

Data integrity issues

As part of these changes, there are often transitional periods during which constraints might be relaxed or deferred. Similarly, locking and invalidation can cause data to not be stored in the way SWEs would expect.

Replication stalls

Database changes can also cause increased activity and lagged replication. This can affect the usefulness of replicas and even put failover at risk.

It is these impacts that we as DBREs must help SWEs proactively identify and avoid.

Migration Patterns

After impact analysis, the SWE should be able to make a decision on the appropriate way to deploy the migration. For many migrations, there is no reason to go through a lot of incremental changes and extensive review work to execute. New objects, data inserts, and other operations can be easily pushed through to production.

After data is in the system, however, changes or removal of existing data, and modification or removal of objects with data in them, create opportunities for your migration to affect service levels as discussed earlier. It is at this time that the SWE should bring the DBRE in. Luckily, there is a relatively finite set of changes for which you can plan. As you work with SWEs to plan and execute on these migrations, you can build a repository of patterns for database changes to be applied. At some point, if they happen frequently and painlessly enough, you can automate them.

For example, you can set up deployment gates in integration and testing that utilize rules-based analysis and testing results to determine whether migrations are safe to be deployed. Some flagged operations could include the following:

  • Updates and deletes without a WHERE clause to filter rows

  • Number of rows impacted is greater than N

  • Alters on tables with a certain dataset size

  • Alters on tables stored in metadata as too busy to have live alters on them

  • New columns with defaults

  • Certain datatypes in create/alter statements such as BLOB (Binary Large Object) files

  • Foreign keys without indexes

  • Operations on particularly sensitive tables

The more flags and safeguards you put in place to enable safety in production for everyone, the more confidence you create in all teams. This results in development velocity. Now, let’s assume that our intrepid SWE who has checked in changeset 456 has had his change flagged due to an alter that is deemed to be impactful. At this point, he can use a migration pattern for that operation if it has been applied and documented. Otherwise, he should create one in collaboration with the DBRE team.

Pattern: locking operations

Adding a column is a very typical operation in most database environments. Depending on the DBMS that you are using, these operations can be quick and simple, without locking the table. Other DBMSs will require a re-creation of the table. When adding the column, you might want to put a default value in the column as well. This will definitely create a significant impact because the value must be entered into each existing row of the table before the change is completed and the lock is released.

One way to avoid some locking operations is to utilize code. For example:

  • Add empty column

  • Perform regression tests

  • Utilize conditional code in the select statement at access time to determine if a row needs updating rather than performing it as a batch statement

  • Set up a watcher to advise when the attribute is fully populated and conditional code can be removed

For some operations, locking of the object is unavoidable. In these cases, you must give a pattern, automatic or manual, to engineers. This might be a tool for performing online changes via triggers and table renames. Or, it might be a rolling migration utilizing proxies and failovers to apply the changes node by node on out-of-service nodes.

It can be tempting to have two processes: one that is lightweight and one that has more steps. This way, you only roll out the complex pattern for major impacting changes. However, this can cause you to rely too heavily on one process, leaving the other underpracticed and perhaps buggy. It is best to be consistent with the process that works most effectively for all locking operations.

Pattern: high resource utilization operations

There are multiple patterns that you can utilize here, depending on the operations that are being executed.

For data modification, throttling by performing the updates in batches is a simple pattern to give engineers when performing bulk operations. For larger environments, it often makes more sense to utilize code to do lazy updating upon login of a user, or querying of a row for example.

For data removal, you can encourage SWEs to utilize soft deletes in their code. A soft delete flags a row as deleteable, which means it can be filtered out of queries in the application and removed at will. You can then throttle the deletes, removing them asynchronously. As with bulk updates, for large datasets this might prove to be impossible. If deletes are regularly performed on ranges, such as dates or ID groupings, you can utilize partitioning features to drop partitions. By dropping a table or partition, you do not create undo I/O which can reduce resource consumption.

Should you find that DDL operations such as table alters create enough I/O that latency is affected, you should consider this a red flag that capacity might be reaching its limits. Ideally, you would work with operations to add more capacity to the data stores. However, if this is not possible or is delayed, these DDL operations can be treated like blocking operations, with the appropriate pattern being applied.

Pattern: rolling migrations

As discussed in the previous sections, it often makes sense to give engineers the ability to apply changes incrementally across each node in a cluster. This is often called a rolling upgrade, because you are rolling the change through each node. You accomplish this somewhat differently depending on whether the cluster can be written to via any node, or if only one node can function as a target write node.

In a write anywhere cluster, such as Galera, you can take one node out of service by removing the node from a service directory or proxy configuration. After draining the traffic from the node, the change can be applied. The node is then reintroduced into the appropriate configuration or directory to put it back in service.

In a write leader cluster, where writes go to one node, you would take the followers in the replication chain out of service individually as in the write anywhere cluster. However, after this has been done for all nodes but the master, there must be a failover to a node with the changeset already applied.

Obviously both of these choices require a lot of orchestration. Understanding what operations are expensive and might require rolling upgrades is a critical part of choosing a datastore. It is also why many people are exploring database solutions that allow for low-impact-only schema evolution.

Migration testing

Even though it might seem evident, it is imperative to recognize that if a changeset’s implementation details are modified, the revised migration must be committed and fully integrated before deployment in post-integration environments, including production.

Rollback testing

In addition to testing migrations and their impact, the DBRE and her supported teams must consider the failure of migrations/deploys and the rolling back of partial or full changesets. Database change scripts should be checked-in at the same time as migrations. There can be autogenerated defaults for some migrations, such as table creations, but there must be an accounting for data that comes in. Therefore, we don’t recommend reverting by simply dropping an object. Renaming tables allows them to still be accessible in case data was written and must be recovered.

Migration patterns also enable ease in the process of defining rollbacks. The lack of an effective rollback script can be a gating factor in the integration and deployment process. To validate that the scripts work, you can use the following deployment and testing pattern:

  • Apply changeset

  • Quick integration tests

  • Apply rollback changeset

  • Quick integration tests

  • Apply changeset

  • Quick integration tests

  • Longer and periodic testing

Much like testing recoveries, testing fallbacks is critical and you must incorporate it into every build and deploy process.

Manual or Automated

Another advantage of using migration patterns is that a migration pattern can allow for automatic approval and deployment rather than waiting for review and execution by the DBRE team. It also means that certain patterns can automatically flag DBREs for implementation outside of the automated process.

There is no reason to rush the path to automation, particularly when dealing with critical data. Community best practices posit that anything run frequently should be automated if at all possible, but this is mitigated by the impact of failed automation. If you have built an environment that has tested and reliable fallbacks, rapid and practiced recovery processes, and mature engineers, you can begin to take migration patterns and automate the application of those changes. But, just moving toward standardized models, push-button deploys and fallbacks, and guard rails/flags provides significant progress toward our goals.

Wrapping Up

In this chapter, we covered the ways in which a DBRE team provides oversized value to software engineering teams through the development, integration, testing and deployment phases of software development. We can’t emphasize enough how much of this relies on collaboration and very close relationships between DBRE, operations, and SWE teams. As the DBRE in this equation, you must consider yourself teacher, diplomat, negotiator, and student. The more you invest in education and relationships, the greater dividends will be paid as you apply the knowledge shared in this section.

A natural progression from release management is that of security. Your data is one of the most significant attack vectors in the infrastructure. Every change and feature potentially creates vulnerabilities that must be planned for and mitigated. In Chapter 9, we discuss how to bring value to security planning and processes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.78.174