We have now reached the end of our journey through the Continuous Delivery Pipeline. We started with a benefit hypothesis to deliver value to our customers and turned it into features to develop in Continuous Exploration. In Continuous Integration, we developed our feature, story by story, and applied those changes into version control, which built and tested the change using the automation of the Continuous Delivery Pipeline until it was ready to go to the production environment. In Continuous Deployment, we propagated our changes to the production environment, keeping the change hidden from the general user population, until we were ready to release.
Now we are ready to release our change to customers. Releasing our change on-demand involves the following four activities:
Let’s begin by looking at the release process.
Up to this point, we have our changes in the production environment, testing them to ensure functionality, security, and reliability. Now we are ready to release. We want to release our changes to the customer for the following reasons:
Even with those reasons, we may not want to introduce our release all at once. On April 23, 1985, the Coca-Cola Company announced the first major change to the formula for its flagship soft drink. New Coke had succeeded in over 200,000 blind taste tests against Pepsi, Coke’s chief competitor. However, upon release, the reaction was swift and negative. The outcry against the new formula had forced Coca-Cola to reintroduce the original formula as Coca-Cola Classic after only 79 days. Since that time, companies have used progressive releases before releasing to the entire market.
If we want to approach releasing incrementally and progressively, we will use the following practices:
We have previously examined feature flags and dark launches in our previous chapter, Chapter 12, Continuous Deployment to Production. Let’s take a look at the other practices of decoupling releases by component architecture and canary releases.
In Chapter 10, Continuous Exploration and Finding New Features, we talked about how one of the key activities in looking at new features to develop was architecting the solution. A part of that is allowing for releasability to meet the organization’s business priorities.
One way of achieving this releasability is to architect your product or solution into major decoupled components. These components can have their own separate release cadences.
In Chapter 2, Culture of Shared Responsibility, we first introduced the idea of operational and development value streams. We discussed how development value streams designed, developed, tested, released, and maintained a product or solution and we identified several development value streams that had solutions that the Operational Value Stream of our video streaming service relied on.
Turning back to this example, let’s examine one of those development value streams, one that maintains the mobile application. This value stream has several components, each with a different release cadence as shown by the following diagram.
Figure 13.1 – Decoupled release schedule for mobile application value stream
In our mobile application value stream example, we release security updates as fixes to vulnerabilities appear, after moving them through the Continuous Delivery Pipeline.
One component is the interface and logic seen on the mobile devices themselves, otherwise known as the frontend. The development here can be released on a quick cadence, effectively at the end of every sprint.
The other component deals with the logic and processing found on the streaming service’s data centers or cloud, known as the backend. In this example, releases for that component occur every month.
The term canary release comes from the practice in mining of carrying a canary into the coal mine. The canary would act as a warning of the presence of toxic gases. Because of its small size, if it died in the coal mine, a toxic gas was present, and the miners should evacuate immediately.
In terms of modern product development, a canary release is the release of a product or new features to a small select group of customers to get their feedback on a release before the entire user population.
To set up a canary release, feature flags are used again to route who receives visibility into the change in the production environment. This feature flag configuration is shown in the following diagram.
Figure 13.2 – Canary release configuration using feature flags
Another possible way to perform a canary release is in distributed production environments. If the production environments are in different geographic regions, the changes are released in one production environment to try on one set of users, while the other production environments remain on their versions. If all goes well, eventually the environments in the remaining regions are upgraded.
Canary releases offer the advantage that they allow for A/B testing where the A group that receives the change can be measured against the B group or control group to see whether the new change creates a desired change in user behavior. Running canary releases as experimentation does require the ability to measure user and system behavior as part of full-stack telemetry.
There may be situations where canary releases should not be done. These may include the following reasons:
As we progress in the release from our initial canary users to the entire user population, we want to be able to ensure that our production environment remains resilient. This may require us to stabilize our solution and ensure proper operation. We will examine the steps needed for this in our next section.
Our goal is to ensure that our production environment remains stable, is resilient to handle the new changes, and that we continue to have sustainable value delivery. To maintain this activity, we want to apply the following practices:
We have previously looked at testing and monitoring NFRs in Chapter 12, Continuous Deployment to Production. Let’s examine the remaining practices.
We first learned about Site Reliability Engineering (SRE) in Chapter 6, Recovering from Production Failures. In that chapter, we saw the following four practices that site reliability engineers use to maintain the production environment when high availability is required for large scaled systems:
In Chapter 6, we saw that if availability SLOs were at four-nines (99.99% availability or higher), the monthly allowable would be 4 minutes, 23 seconds. To maintain that availability, SREs use the previously mentioned practices to ensure reliability and have standard incident management policies defined and rehearsed to minimize downtimes when problems do occur.
Other principles and practices adopted by Google can give us a better picture of the discipline of SRE and how it contributes to the DevOps approach. To understand these additional principles and practices, it may be necessary to look at the origins of SRE at Google.
SRE started at Google in 2004 by Ben Treynor Sloss. His original view was to rethink how operations were performed by system administration. He wanted to approach problems found in operations from a software development perspective. From that perspective, the following principles emerged in addition to the preceding practices:
The people Sloss enlisted for his initial SRE team would spend half of their time in development and the other half in operations to follow changes they developed from beginning to end. This allowed them to develop skills necessary for operations as well as maintain their development expertise.
Since 2004, the number of practices that have been pulled into the discipline of SRE has expanded as technology has evolved. Nevertheless, many site reliability engineers have kept to the principles previously outlined. Adopting these principles and practices may provide tangible benefits when reliability is a key nonfunctional requirement.
Murphy’s Law famously states that “anything that can go wrong, will.” In that sense, it’s not if a disaster will strike your system, but when. We’ve looked at ways of preventing disaster and using chaos engineering to simulate disasters, but are there other ways of preparing for disaster?
Disaster recovery focuses on ensuring that the technical aspects that are important to a business are restored as quickly as possible should a natural or manmade disaster strike. Disaster recovery incorporates the following elements to prepare for the worst:
The disaster recovery plan will look at the following two measurements as goals to determine the overall strategy:
Let’s explain these objectives using an example shown in the following diagram.
Figure 13.3 – Illustration of RPO and RTO
In the preceding diagram, we take a backup of our resource every four hours. We have practiced performing disaster recovery on our resource and have reliably restored operations in 30 minutes. When disaster strikes, if it is something familiar and has been rehearsed, our Time to Restore Service will match our RTO objective and be 30 minutes.
For the RPO, we need to see the point of the last backup. In our example, the time difference between the last backup and the disaster could be as large as four hours. The higher the frequency of backups, the lower the RPO.
The mechanisms for disaster recovery can take many forms. Organizations may opt to use one or a combination of the following methods:
In previous stages of our Continuous Delivery pipeline, our focus on security was prevention. We wanted to ensure that the changes we designed and developed did not introduce security vulnerabilities. So, the automated security testing we performed on the pipeline examined the code changes.
With the code released, we shift our focus from prevention to detecting threats that come from malicious actors. We look for breaches or attacks using currently unknown vulnerabilities.
The National Institute of Standards and Technology (NIST) looks at Continuous Security Monitoring (CSM) as Information Security Continuous Monitoring. In a white paper published in September 2011 (https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-137.pdf), they describe Information Security Continuous Monitoring as the following:
Information security continuous monitoring (ISCM) is defined as maintaining ongoing awareness of information security, vulnerabilities, and threats to support organizational risk management decisions.
The white paper further defines a process to implement CSM by incorporating the following steps:
The assets that need to be monitored include not only ones directly maintained by the organization but may also extend to third parties and vendors. Monitoring tools may look at the following assets:
Once identified, the assets should be examined for possible threats and vulnerabilities. A sampling of common threats and vulnerabilities are the following:
Identification of these attacks may be part of an assessment created after automated monitoring. The mitigation steps and action will outline who in the organization will perform the remediation steps and coordinate the response.
The support activities performed in this stage of the Continuous Delivery Pipeline will have a profound effect on the architecture of the system and may even drive the future direction of the product or solution. These things may be part of the architectural decisions the system architect makes when looking at new capabilities in the Continuous Exploration stage of the Continuous Delivery Pipeline.
Decisions found in this stage that may be taken to Continuous Exploration include the following items:
The system architect thus becomes the balance point of intentional architecture, overseeing the desired architecture of the product or solution, and of emergent design, where other factors such as the environment play a role in requiring changes to the architecture.
The learning at this stage is not limited to the architectural aspects. We also have to evaluate from a product performance standpoint whether the benefit hypothesis was fulfilled by our development. For this, we need to measure our solution’s value. Let’s look at this in our next section.
Throughout the design and development journey through the Continuous Delivery Pipeline, we have subjected our changes to an array of testing. We will now look at the final test to answer the question: is our development effort bringing value to the customers to the point that this benefits both the customer and the organization?
To aid us in answering this question, we will look at the following activities:
We will first revisit innovation accounting and its source: the Lean Startup Cycle. Based on that knowledge, we will see how leading and lagging indicators prove or disprove the benefit hypothesis we created in Continuous Exploration.
We first saw innovation accounting in Chapter 5, Measuring the Process and Solution. In the chapter, according to The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses by Eric Ries, we saw that measures were important to gauge whether the benefit hypothesis had been proven.
Ries expands upon innovation accounting in his follow-up book, The Startup Way: How Modern Companies Use Entrepreneurial Management to Transform Culture and Drive Long-Term Growth. In his book, he gives the following definition for innovation accounting:
Innovation Accounting (IA) is a way of evaluating progress when all the metrics typically used in an established company (revenue, customers, ROI, market share) are effectively zero.
He proposes three levels of innovation accounting, each with a different set of metrics to collect. Let’s look at these levels now.
These metrics serve as a starting point. Ries advises setting up a dashboard. On this initial dashboard are customer-facing metrics the development teams think are important. These types of metrics are based on per-customer input. The following per-customer metrics for learning include the following:
Metrics like these help enforce the idea that development can affect these metrics after observing the outcomes of their efforts. Metrics here that are visible help keep the development process aligned with feedback.
We proceed deeper with a different set of metrics in Level 2. Here, we try to quantify the “leap of faith assumptions” as Ries calls them. Leap of faith assumptions come in the following two categories:
These types of assumptions are needed for any new development, otherwise nothing new would be developed. These assumptions get tested by the development of the MVP and other validated exercises.
The following metrics illustrate the assumptions and categories of assumptions:
Value assumptions look for customer behavior. They are met based on positive user behavior. Growth assumptions are looking for sustainable growth.
At this level, we look at performance over a long period of time. Changes will come as you acquire new data and re-evaluate or compare the present data over what was forecasted. You look at long-term drivers of the product’s future performance.
The following metrics may provide guidance over the long term:
These metrics often involve more than the development teams. Finance may also be involved with the goal of shifting the focus at this point to the financial performance of the product.
Now that we’ve seen the steps taken with innovation accounting, let’s see how they play against the benefit hypothesis we created in Continuous Exploration.
In our journey through the Continuous Delivery Pipeline, we created our benefit hypothesis in Continuous Exploration. To measure the validity of our hypothesis, we may be incorporating Level 2 metrics in our dashboard. Our dashboard is acquiring the metrics automatically through the full-stack telemetry that we have designed into our system in our test, staging, and production environments.
We may get some indications from testing done at Continuous Deployment, but the real measurements will come from Release on Demand, first during any A/B testing or canary releases, but also upon general release. We want to see all correlations between the performance and our initial benefit hypothesis.
The data we collect and analyze now in this step may not be just to improve the product, but also to improve our value stream and our development process. Let’s explore our next section to see how both are accomplished.
Based on our learning, we now need to figure what’s the best next step. This is true from the perspective of our product as well as the perspective of our value stream.
For our product, it’s a matter of determining the best direction. This may mean whether it’s time to pivot, change direction in our overall product strategy, or persevere or continue in the same direction.
For our value stream, this time is used to reflect on how to improve. What lessons are there that we can learn from to improve?
The following practices are used to determine future product direction as well as future direction of our value stream:
We perform this learning so we can begin again with renewed focus at the start of the Continuous Delivery Pipeline in setting up ideas to execute. We improve our value streams to improve the performance of our Continuous Delivery Pipeline.
Let’s look at these practices that allow us to improve.
In Chapter 10, Continuous Exploration and Finding New Features, we viewed the application of Eric Ries’s Build-Measure-Learn cycle into the SAFe® Lean Startup cycle where we saw how Epics are created with a benefit hypothesis. The Epics are implemented with a Minimum Viable Product (MVP), which acts as an experiment to prove or disprove the benefit hypothesis. The MVP is evaluated through innovation accounting and tracking of leading indicators that will be used to determine the pivot or persevere decision.
We are now at that point of that decision. For our ART, the MVP may be a few features created at the beginning of the Continuous Delivery Pipeline. We have developed these features, tested them, and deployed them into the production environment through the Continuous Integration and Continuous Deployment stages of our pipeline. Now, in Release on Demand, we show our MVP as features to the user population to see whether the benefit hypothesis is proven. Based on the innovation accounting metrics we are collecting through our full-stack telemetry, we come to the following two decisions:
Note that even after our persevere decision for the MVP, our features will still be evaluated to determine that they prove to be valuable and that the product direction still resonates with our customers.
We first saw relentless improvement when we first examined the SAFe House of Lean in Chapter 2, Culture of Shared Responsibility, We mentioned that in relentless improvement, we looked for opportunities to be better because of that “hidden sense of danger.”
Throughout development, the teams and the ART have looked for such opportunities to improve the flow of value. The teams have been regularly holding retrospectives at the end of each iteration to identify problems at the team level and ART holds the Inspect and Adapt (I&A) event at the end of the Program Increment to look at systemic issues.
Other improvements may come to the Continuous Delivery Pipeline itself. Newer tools, additional testing, and continued maintenance allow the ART to maintain or improve how they create the experiments to validate benefit hypotheses.
Another important part of relentless improvement comes from Value Stream Management and continuous learning, ideas originally discussed in Chapter 9, Moving to the Future with Continuous Learning.
One activity we originally performed in our value stream mapping was not only mapping our value stream as it stands currently, but also identifying an ideal future-state value stream. Improvement actions could come from small, iterative changes toward that ideal value stream.
One other step for improvement is to hold a value stream mapping session at least once a year to evaluate the value stream in its current iteration. This allows the ART to view the present bottlenecks impeding the flow of value. During this value stream mapping session, a new future-state value stream can be identified for new improvements.
In this chapter, we have reached the last stage of the Continuous Delivery Pipeline: Release on Demand. After using feature flags to allow testers to view new changes in the development environment, we can use them to incrementally release those changes to a small population of users in a canary release. We may also want to set up our architecture to allow for each component to have different release cadences. After release, we want to ensure that the changes don’t disrupt the environment and that our overall solution is stable. To do that, we adhere to the principles and primary practices of SRE, including preparing for disaster recovery.
With the changes released in a stable environment, it’s time to measure business results through full-stack telemetry. We selected our measures during Continuous Exploration, relying on innovation accounting principles. Looking at these measures on dashboards visible to everyone, we can try to determine whether our benefit hypothesis that we created in Continuous Exploration is valid.
Based on whether the benefit hypothesis is valid or not, we must decide to pivot, change direction, or persevere and continue developing in the same product direction. After making that decision, the ART looks at other improvement opportunities through retrospectives, I&A, and regularly mapping their value stream.
This brings us to a close on Part 3. We now will take a look at emerging trends as well as some tips and tricks for success in your DevOps adoption in our last chapter of the book.
18.119.14.235