Chapter 18. Barrier Conditions and Rollback

He will conquer who has learned the artifice of deviation. Such is the art of maneuvering.

—Sun Tzu

Whether your development methodology is Agile, waterfall, or some hybrid, implementing the right processes for deploying updates into your production environment can protect you from significant failures. Conversely, poor processes are likely to cause painful and persistent problems. Checkpoints and barrier conditions within your product development life cycle can increase quality and reduce the cost of developing your product. However, even the best teams, armed with the best processes and great technology, may make mistakes and incorrectly analyze the results of certain tests or reviews. If your platform implements a service, you need to be able to quickly roll back significant releases to keep scale-related events from creating availability incidents.

Developing effective go/no-go processes or barrier conditions and coupling them with a process and capability to roll back production changes are necessary components within any highly available and scalable service. The companies focused most intensely on cost-effectively scaling their systems while guaranteeing high availability create several checkpoints in their development processes. These checkpoints represent an attempt to guarantee the lowest probability of a scalability-related event and to minimize the impact of any such event should it occur. Companies balancing scalability and high availability also make sure that they can quickly get out of any event created through recent changes by ensuring that they can always roll back from any major change.

Barrier Conditions

You might read this heading and immediately assume that we are proposing that waterfall development cycles are the key to success within highly scalable environments. Very often, barrier conditions or entry and exit criteria are associated with the phases of waterfall development—and sometimes identified as a reason for the inflexibility of a waterfall development model. Our intent here is not to promote the waterfall methodology, but rather to discuss the need for standards and protective measures regardless of which approach you take for development.

For the purposes of this discussion, assume that a barrier condition is a standard against which you measure success or failure within your development life cycle. Ideally, you want to have these conditions or checkpoints established within your cycle to help you decide whether you are, indeed, on the right path for the product or enhancements that you are developing. Recall our discussion of goals in Chapter 4, Leadership 101, and Chapter 5, Management 101, and the need to establish and measure these goals. Barrier conditions are static goals within a development methodology that ensure your product aligns with your vision and need. A barrier condition for scalability might include checking a design against your architectural principles within an Architecture Review Board process before the design is implemented. It might include completing a code review to ensure the code is consistent with the design or performance testing and implementation. In a continuous delivery environment, it might be executing the unit test library before releasing into production.

Barrier Conditions and Agile Development

In our practice, we have found that many of our clients have a mistaken perception that including or defining standards, constraints, or processes in Agile processes is a violation of the Agile mindset. The very notion that process runs counter to Agile methodologies is flawed from the outset—any Agile method is itself a process. Most often, we find the Agile Manifesto quoted out of context as a reason for eschewing any process or standard.1 As a review, and from the Agile Manifesto, Agile methodologies value

1. This information is from the Agile Manifesto, www.agilemanifesto.org.

• Individuals and interactions over processes and tools

• Working software over comprehensive documentation

• Customer collaboration over contract negotiation

• Responding to change over following a plan

Organizations often take the “individuals and interactions over processes and tools” edict out of context without reading the line that follows these bullets: “That is, while there is value in the items from the right, we value the items on the left more.”2 This statement clarifies that processes add value, but that people and interactions should take precedent over them where we need to make choices. We absolutely agree with this approach. As such, we prefer to inject process into Agile development most often in the form of barrier conditions to test for an appropriate level of quality, scalability, and availability, or to help ensure that engineers are properly evaluated and given guidance. Let’s examine how some key barrier conditions enhance our Agile method.

2. Ibid.

We’ll begin with the notion of valuing working software over comprehensive documentation. None of the suggestions we’ve made—from ARB and code reviews to performance testing and production measurement—violate this rule. The barrier conditions represented by the ARB and joint architecture design (JAD) processes are used within Agile methods to ensure that the product under development can scale appropriately. The ARB and JAD processes can be performed verbally in a group and with limited documentation and, therefore, are consistent with the Agile method.

The inclusion of barrier conditions and standards to help ensure that systems and products work properly in production actually supports the development of working software. We have not defined comprehensive documentation as necessary in any of our proposed activities, although it is likely that the results of these activities will be logged somewhere. Remember, we are interested in improving our processes over time. Logging performance results, for instance, will help us determine how often we are making mistakes in our development process that result in failed performance tests in QA or scalability issues within production. This is similar to the way Scrum teams measure velocity to get better at estimations.

The processes we’ve suggested also do not in any way hinder customer collaboration or support contract negotiation over customer collaboration. In fact, one might argue that they foster a better working environment with the end customer by inserting scalability barrier conditions to better serve the customer’s needs. Collaborating to develop tests and measurements that will help ensure that your product meets customer needs and then inserting those tests and measurements into your development process is a great way to take care of your customers and create shareholder value.

Finally, the inclusion of the barrier conditions we’ve suggested helps us respond to change by helping us identify when a change is occurring. The failure of a barrier condition is an early alert to issues that we need to address immediately. Identifying that a component is incapable of being scaled horizontally (“scale out, not up” from our recommended architectural principles) in an ARB session is a good indication of potential problems that might be encountered by customers. Although we may make the executive decision to launch the feature, product, or service, we had better ensure that future Agile cycles are used to fix the issue we’ve identified. However, if the need for scale is so dramatic that a failure to scale out will keep us from being successful, should we not respond immediately to that issue and fix it? Without such a process and series of checks, how would we ensure that we are meeting our customers’ needs?

We hope this discussion has convinced you that the addition of criteria against which you can evaluate the success of your scalability objectives is a good idea within your Agile implementation. If not, please remember the “board of directors” test introduced in Chapter 5, Management 101: Would you feel comfortable stating that you absolutely would not develop processes within your development life cycle to ensure that your products and services could scale? Imagine yourself saying to the board, “In no way, shape, or form will we ever implement barrier conditions or criteria to ensure that we don’t release products with scalability problems!” How long do you think you would have a job?

Barrier Conditions and Waterfall Development

The inclusion of barrier conditions within waterfall models is not a new concept. Most waterfall implementations include a concept of entry criteria and exit criteria for each phase of development. For instance, in a strict waterfall model, design may not start until the requirements phase is completed. The exit criteria for the requirements phase, in turn, may include a signoff by key stakeholders, a review of requirements by the internal customer (or an external representative), and a review by the organizations responsible for producing those requirements. In modified, overlapping, or hybrid waterfall models, requirements may need to be complete for the systems to be developed first but may not be complete for the entire product or system. If prototyping is employed, potentially those requirements may need to be mocked up in a prototype before major design starts.

For our purposes, we need inject only the four processes we identified earlier into the existing barrier conditions. The Architecture Review Board lines up nicely as an exit criterion for the design phase of our project. Code reviews, including a review consistent with our architectural principles, might create exit criteria for our coding or implementation phase. Performance testing that specifies a maximum percentage change allowed for critical systems should be performed during the validation or testing phase. Production measurements being defined and implemented should be the entry criteria for the maintenance phase. Significant increases in any measured area, if not expected, should trigger new work items to reduce the impact of the implementation or changes in architecture to allow for more cost-effective scalability.

Barrier Conditions and Hybrid Models

Many companies have developed models that merge Agile and waterfall methodologies, and some continue to follow the predecessor to Agile methods known as rapid application development (RAD). For instance, some companies may be required to develop software consistent with contracts and predefined requirements, such as those that interact with governmental organizations. These companies may wish to have some of the predictability of dates associated with a waterfall model, but desire to implement chunks of functionality quickly as in Agile approaches.

The question for these models is where to place the barrier conditions for the greatest benefit. To answer that question, we need to return to the objectives when using barrier conditions. Our intent with any barrier condition is to ensure that we catch problems or issues early in our development so that we reduce the amount of rework needed to meet our objectives. It costs us less in time and work, for instance, to catch a problem in our QA organization than it does in our production environment. Similarly, it costs us less to catch an issue in the ARB review than if we allow that problem to be implemented and subsequently catch it only in a code review.

The answer to the question of where to place the barrier conditions, then, is simple: Put the barrier conditions where they add the most value and incur the least cost to our processes. Code reviews should be placed at the completion of each coding cycle or at the completion of chunks of functionality. The architectural review should occur prior to the beginning of implementation, production metrics obviously need to occur within the production environment, and performance testing should happen prior to the release of a system into the production environment.

Rollback Capabilities

You might argue that the inclusion of an effective set of barrier conditions in the development process should obviate the need to roll back major changes within the production environment. We can’t really argue with that thought or approach—technically, it is correct. However, arguing against having the capability to perform a rollback is really an argument against having an insurance policy. You might believe, for instance, that you don’t need health insurance because you are a healthy individual. But what happens when you develop a treatable cancer and don’t have sufficient funds to cover the treatment? If you are like most people, your view of whether you need this insurance changes immediately when it would become useful. The same holds true when you find yourself in a situation where fixing forward is going to take quite a bit of time and have quite an adverse impact on your clients.

Rollback Window

Rollback windows—that is, how much time must pass after a release before you are confident that you will not need to roll back the change—differ significantly by business. The question to ask yourself when determining how to establish your specific rollback window is how you will know when you have enough information about performance to determine if you need to undo your recent changes. For many companies, the bare minimum is that a weekly business day peak utilization period is needed to have great confidence in the results of their analysis. This bare minimum may be enough for modifications to existing functionality, but when new functionality is added, it may not be enough.

New functions or features often have adoption curves that take more than one day to shuffle enough traffic through that feature to determine its true impact on system performance. The amount of data gathered over time within any new feature may also have an adverse performance impact and, as a result, may negatively impact your scalability.

Another consideration when determining your rollback window deals with the frequency of your releases and the number of releases you need to be capable of rolling back. Perhaps you have a release process that involves releasing new functionality to your site several times a day. In this case, you may need to roll back more than one release if the adoption rate of any new functionality extends into the next release cycle. If this is the case, your process needs to be slightly more robust, as you are concerned about multiple changes and multiple releases, rather than just the changes from one release to the next.

Rollback Technology Considerations

We often hear during our discussions around the rollback insurance policy that clients in general agree that being able to roll back would be great, but doing so is technically not feasible for them. Our response is that rollback is almost always possible; it just may not be possible with your current team, processes, or architecture.

The most commonly cited reason for an inability to perform a rollback in Web-enabled platforms and back-office IT systems is database schema incompatibility. The argument usually goes as follows: For any major development effort, there may be significant changes to the schema resulting in an incompatibility with the way old and new data are stored. This modification may result in table relationships changing, candidate keys changing, table columns changing, tables being added, tables being merged, tables being disaggregated, and tables being removed.

The key to fixing these database issues is to grow your schema over time and to keep old database relationships and entities for at least as long as you would need to roll back to them should you run into significant performance issues. In the case where you need to move data to create schemas of varying normal forms, either for functionality reasons or performance reasons, consider using either data movement programs that are potentially launched by a database trigger or a data movement daemon or third-party replication technology. This data movement can cease whenever you have met or exceeded the rollback version number limit identified in your requirements. Ideally, you can turn off such data movement systems within a week or two after implementation and validation that you do not need to roll back.

Ideally, you will limit such data movement, and instead populate new tables or columns with new data, while leaving old data in its original columns and tables. In many cases, this approach is sufficient to meet your needs. In the case where you are reorganizing data, simply move the data from the new positions to the old ones for the period of time necessary to perform the rollback. If you need to change the name of a column or its meaning within an application, you must first make the change in the application, leaving the database alone; in a future release, you can then change the database. This is an example of the general rollback principle of making the change in the application in the earlier release and making the change in the database in a later release.

Cost Considerations of Rollback

If you’ve gotten to this point and determined that designing and implementing a rollback insurance policy has a cost, you are absolutely right! For some releases, the cost can be significant, adding as much as 10% or 20% to the cost of the release. In most cases and for most releases, we believe that you can implement an effective rollback strategy for less than 1% of the cost or time of the release. In many cases, you are really just talking about different ways to store data within a database or other storage system. Insurance isn’t free, but it exists for a reason.

Many of our clients have implemented procedures that allow them to violate the rollback architectural principle as long as several other risk mitigation steps or processes are in place. We typically suggest that the CEO or general manager of the product or service in question sign off on the risk and review the risk mitigation plan (see Chapter 16, Determining Risk) before agreeing to violating the rollback architectural principle. In the ideal scenario, the principle will be violated only with very small, very low-risk releases where the cost of being able to roll back exceeds the value of the rollback given the size and impact of the release. Unfortunately, what typically happens is that the rollback principle is violated for very large and complex releases to satisfy time-to-market constraints. The problem with this approach is that these large complex releases are often precisely the ones for which you need the rollback capability the most.

Challenge your team whenever team members indicate that the cost or difficulty to implement a rollback strategy for a particular release is too high. Often, simple solutions, such as implementing short-lived data movement scripts, may be available to help mitigate the cost and increase the possibility of implementing the rollback strategy. Sometimes, implementing markdown logic for complex features rather than seeking to ensure that the release can be rolled back can significantly mitigate the risk of a release. In our consulting practice at AKF Partners, we have seen many team members who start by saying, “We cannot possibly roll back.” After they accept the fact that it is possible, they are then able to come up with creative solutions for almost any challenge.

Markdown Functionality: Design to Be Disabled

Another of our architectural principles from Chapter 12, Establishing Architectural Principles, is designing a feature to be disabled. This concept differs from rolling back features in at least two ways.

First, if this approach is implemented properly, it is typically faster to turn a feature off than it is to replace it with the previous version or release of the system. When done well, the application may listen to a dedicated communication channel for instructions to disallow or disable certain features. Other approaches may require the restart of the application to pick up new configuration files. Either way, it is typically much faster to disable functions causing scalability problems than it is to replace the system with the previous release.

A second way that functionality disabling differs from rolling back is that it might allow all of the other functions within any given release, both modified and new, to continue to function as normal. For example, if in a dating site we had released both a “Has he dated a friend of mine?” search and another feature that allows the rating of any given date, we would need to disable only our search feature until a problem with these features is fixed, rather than rolling back and in effect turning off both features. This obviously gives us an advantage in releases containing multiple fixes directed toward modified and new functionality.

Designing all features to be disabled, however, can sometimes add an even more significant cost than designing to roll any given release back. The ideal case is that the cost is low for both designing to be disabled and rolling back, and the company chooses to do both for all new and modified features. Most likely, you will identify features that are high risk, using the failure mode and effects analysis process described in Chapter 16, to determine which should have markdown functionality enabled. Code reuse or a shared service that is called asynchronously may significantly reduce the cost of implementing functions that can be disabled on demand. Implementing both rollback and feature disabling helps enable Agile methods by creating an adaptive and flexible production environment rather than one relying on predictive methods such as extensive, costly, and often low-return performance testing.

If implemented properly, designing to be disabled and designing for rollbacks can actually improve your time to market by allowing you to take some risks in production that you would not take in their absence. Although not a replacement for load and performance testing, these strategies allow you to perform such testing much more quickly, confident in the knowledge that you can easily move back from implementations once released.

Conclusion

This chapter covered topics such as barrier conditions, rollback capabilities, and markdown capabilities, all of which are intended to help companies manage the risks associated with scalability incidents and recover quickly from these events if and when they occur. Barrier conditions (i.e., go/no-go processes) focus on identifying and eliminating risks to future scalability early within a development process, thereby lowering the cost of identifying the issue and eliminating the threat of it in production. Rollback capabilities allow for the immediate removal of any scalability-related threat, thereby limiting its impact on customers and shareholders. Markdown and disabling capabilities allow features impacting scalability to be disabled on a per-feature basis, removing them as threats when they cause problems. Many other mechanisms for facilitating rollback are also available, including changing DNS records or alternating pools of virtual machines with different versions of code.

Ideally, you will consider implementing all of these measures. Sometimes, on a per-release basis, the cost of implementing either rollback or markdown capabilities is exceptionally high. In these cases, we recommend a thorough review of the risks and all of the risk mitigation steps possible to help minimize the impact on your customers and shareholders. If both markdown and rollback have high costs, consider implementing at least one unless the feature is small and not complex. Should you decide to forego implementing both markdown and rollback, ensure that you perform adequate load and performance testing and that you have all of the necessary resources available during product launch to monitor and recover from any incidents quickly.

Key Points

• Barrier conditions (go/no-go processes) exist to isolate faults early in the development life cycle.

• Barrier conditions can work with any development life cycle. They do not need to be documented extensively, although data should be collected to learn from past mistakes.

• Architecture Review Board reviews, code reviews, performance testing, and production measurements can all be considered examples of barrier conditions if the result of a failure of one of these conditions is to rework the system in question.

• Designing the capability to roll back your application helps limit the scalability impact of any given release. Consider it an insurance policy for your business, shareholders, and customers.

• Designing to disable, or mark down, features complements designing for rollback and adds the flexibility of keeping the most recent release in production while eliminating the impact of offending features or functionality.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.132.12