13

Releasing on Demand to Realize Value

We have now reached the end of our journey through the Continuous Delivery Pipeline. We started with a benefit hypothesis to deliver value to our customers and turned it into features to develop in Continuous Exploration. In Continuous Integration, we developed our feature, story by story, and applied those changes into version control, which built and tested the change using the automation of the Continuous Delivery Pipeline until it was ready to go to the production environment. In Continuous Deployment, we propagated our changes to the production environment, keeping the change hidden from the general user population, until we were ready to release.

Now we are ready to release our change to customers. Releasing our change on-demand involves the following four activities:

  • Releasing that value to customers
  • Stabilizing our solution in operations
  • Measuring the value
  • Learning the outcomes

Let’s begin by looking at the release process.

Releasing value to customers

Up to this point, we have our changes in the production environment, testing them to ensure functionality, security, and reliability. Now we are ready to release. We want to release our changes to the customer for the following reasons:

  • We think the timing is right for the customer to take advantage and when the organization thinks there is high market demand
  • We have confidence that the change will not have a negative impact on the production environment

Even with those reasons, we may not want to introduce our release all at once. On April 23, 1985, the Coca-Cola Company announced the first major change to the formula for its flagship soft drink. New Coke had succeeded in over 200,000 blind taste tests against Pepsi, Coke’s chief competitor. However, upon release, the reaction was swift and negative. The outcry against the new formula had forced Coca-Cola to reintroduce the original formula as Coca-Cola Classic after only 79 days. Since that time, companies have used progressive releases before releasing to the entire market.

If we want to approach releasing incrementally and progressively, we will use the following practices:

  • Feature flags
  • Dark launches
  • Decoupling releases by component architecture
  • Canary releases

We have previously examined feature flags and dark launches in our previous chapter, Chapter 12, Continuous Deployment to Production. Let’s take a look at the other practices of decoupling releases by component architecture and canary releases.

Decoupling releases by component architecture

In Chapter 10, Continuous Exploration and Finding New Features, we talked about how one of the key activities in looking at new features to develop was architecting the solution. A part of that is allowing for releasability to meet the organization’s business priorities.

One way of achieving this releasability is to architect your product or solution into major decoupled components. These components can have their own separate release cadences.

In Chapter 2, Culture of Shared Responsibility, we first introduced the idea of operational and development value streams. We discussed how development value streams designed, developed, tested, released, and maintained a product or solution and we identified several development value streams that had solutions that the Operational Value Stream of our video streaming service relied on.

Turning back to this example, let’s examine one of those development value streams, one that maintains the mobile application. This value stream has several components, each with a different release cadence as shown by the following diagram.

Figure 13.1 – Decoupled release schedule for mobile application value stream

Figure 13.1 – Decoupled release schedule for mobile application value stream

In our mobile application value stream example, we release security updates as fixes to vulnerabilities appear, after moving them through the Continuous Delivery Pipeline.

One component is the interface and logic seen on the mobile devices themselves, otherwise known as the frontend. The development here can be released on a quick cadence, effectively at the end of every sprint.

The other component deals with the logic and processing found on the streaming service’s data centers or cloud, known as the backend. In this example, releases for that component occur every month.

Canary releases

The term canary release comes from the practice in mining of carrying a canary into the coal mine. The canary would act as a warning of the presence of toxic gases. Because of its small size, if it died in the coal mine, a toxic gas was present, and the miners should evacuate immediately.

In terms of modern product development, a canary release is the release of a product or new features to a small select group of customers to get their feedback on a release before the entire user population.

To set up a canary release, feature flags are used again to route who receives visibility into the change in the production environment. This feature flag configuration is shown in the following diagram.

Figure 13.2 – Canary release configuration using feature flags

Figure 13.2 – Canary release configuration using feature flags

Another possible way to perform a canary release is in distributed production environments. If the production environments are in different geographic regions, the changes are released in one production environment to try on one set of users, while the other production environments remain on their versions. If all goes well, eventually the environments in the remaining regions are upgraded.

Canary releases offer the advantage that they allow for A/B testing where the A group that receives the change can be measured against the B group or control group to see whether the new change creates a desired change in user behavior. Running canary releases as experimentation does require the ability to measure user and system behavior as part of full-stack telemetry.

There may be situations where canary releases should not be done. These may include the following reasons:

  • If the solution is part of a mission-critical, medical, or safety system where there is low tolerance for failure
  • If the end users will react negatively to being guinea pigs or treated as beta testers
  • If the changes require changes on backend configurations such as database schemas that are not compatible with the current production version

As we progress in the release from our initial canary users to the entire user population, we want to be able to ensure that our production environment remains resilient. This may require us to stabilize our solution and ensure proper operation. We will examine the steps needed for this in our next section.

Stabilizing and operating the solution

Our goal is to ensure that our production environment remains stable, is resilient to handle the new changes, and that we continue to have sustainable value delivery. To maintain this activity, we want to apply the following practices:

  • Site reliability engineering
  • Failover and disaster recovery
  • Continuous Security Monitoring
  • Architecting for operations
  • Monitoring NFRs

We have previously looked at testing and monitoring NFRs in Chapter 12, Continuous Deployment to Production. Let’s examine the remaining practices.

Site reliability engineering

We first learned about Site Reliability Engineering (SRE) in Chapter 6, Recovering from Production Failures. In that chapter, we saw the following four practices that site reliability engineers use to maintain the production environment when high availability is required for large scaled systems:

  • Formulation of an error budget using Service Level Indicators (SLI) and Service Level Objectives (SLO)
  • Creating standards for release through release engineering
  • Collaborating on product launches with launch coordination engineering
  • Practicing recovery with chaos engineering and incident management procedures

In Chapter 6, we saw that if availability SLOs were at four-nines (99.99% availability or higher), the monthly allowable would be 4 minutes, 23 seconds. To maintain that availability, SREs use the previously mentioned practices to ensure reliability and have standard incident management policies defined and rehearsed to minimize downtimes when problems do occur.

Other principles and practices adopted by Google can give us a better picture of the discipline of SRE and how it contributes to the DevOps approach. To understand these additional principles and practices, it may be necessary to look at the origins of SRE at Google.

SRE started at Google in 2004 by Ben Treynor Sloss. His original view was to rethink how operations were performed by system administration. He wanted to approach problems found in operations from a software development perspective. From that perspective, the following principles emerged in addition to the preceding practices:

  • Eliminating toil, that is, finding the repetitive tasks and seeing whether they could be eliminated
  • Increased use of automation as a way of cost effectively eliminating toil
  • Monitoring every aspect of the production environment, which leads to observability

The people Sloss enlisted for his initial SRE team would spend half of their time in development and the other half in operations to follow changes they developed from beginning to end. This allowed them to develop skills necessary for operations as well as maintain their development expertise.

Since 2004, the number of practices that have been pulled into the discipline of SRE has expanded as technology has evolved. Nevertheless, many site reliability engineers have kept to the principles previously outlined. Adopting these principles and practices may provide tangible benefits when reliability is a key nonfunctional requirement.

Failover and disaster recovery

Murphy’s Law famously states that “anything that can go wrong, will.” In that sense, it’s not if a disaster will strike your system, but when. We’ve looked at ways of preventing disaster and using chaos engineering to simulate disasters, but are there other ways of preparing for disaster?

Disaster recovery focuses on ensuring that the technical aspects that are important to a business are restored as quickly as possible should a natural or manmade disaster strike. Disaster recovery incorporates the following elements to prepare for the worst:

  • Disaster recovery team: A group of individuals that are responsible for creating, implementing, and managing a disaster recovery plan. The disaster recovery plan outlines the responsibilities for the team to follow in an emergency, including communication with other employees and customers.
  • Risk evaluation: The disaster recovery team should identify the possible scenarios and the appropriate responses for each. For example, if a cyberattack happened, what would the steps be in the disaster recovery plan?
  • Asset identification and evaluation: The disaster recovery team should identify all systems, applications, data, and other resources. Part of this identification includes how important they are for business continuity as well as instructions for restoring them.
  • Resource backups: The disaster recovery plan should identify what resources should be backed up, the frequency of backup, where those backups are stored, and for how long the backups are kept.
  • Dress rehearsals: All parts of the disaster recovery plan should be practiced regularly. Restores of backups should be attempted to find flaws in the backup process and to determine whether the backups are sound. Any flaws found while rehearsing should be fixed to improve the disaster recovery plan. Rehearsals should examine evolving threats to see whether new measures should be added to the disaster recovery plan.

The disaster recovery plan will look at the following two measurements as goals to determine the overall strategy:

  • Recovery Point Objective (RPO) is a measure of the state of the data, measured from the time of the last backup when the resource of record is restored
  • Recovery Time Objective (RTO) is a measure of the allowable downtime after a disaster

Let’s explain these objectives using an example shown in the following diagram.

Figure 13.3 – Illustration of RPO and RTO

Figure 13.3 – Illustration of RPO and RTO

In the preceding diagram, we take a backup of our resource every four hours. We have practiced performing disaster recovery on our resource and have reliably restored operations in 30 minutes. When disaster strikes, if it is something familiar and has been rehearsed, our Time to Restore Service will match our RTO objective and be 30 minutes.

For the RPO, we need to see the point of the last backup. In our example, the time difference between the last backup and the disaster could be as large as four hours. The higher the frequency of backups, the lower the RPO.

The mechanisms for disaster recovery can take many forms. Organizations may opt to use one or a combination of the following methods:

  • Backups: Backups are the simplest form of disaster recovery. Note that this ensures that the data is kept safe; this does nothing for infrastructure.
  • Cold site: A redundant second production environment. This allows for business continuity, but there is no way to restore data. A blue/green deployment is an example of a cold site.
  • Hot site: A redundant second production environment that has its data regularly synchronized with the active production environment.
  • Disaster Recovery as a Service (DRaaS): A vendor moves an organization’s processing capability from the organization to its own (often cloud-based) infrastructure.
  • Backup as a Service (BaaS): Backups are taken and reside off-site or stored in a cloud infrastructure by a third-party provider.
  • Data center disaster recovery: These are devices on the organization’s premises used to deal with disasters such as fire or power loss. Examples of these include backup power generators or fire suppression equipment.

Continuous Security Monitoring

In previous stages of our Continuous Delivery pipeline, our focus on security was prevention. We wanted to ensure that the changes we designed and developed did not introduce security vulnerabilities. So, the automated security testing we performed on the pipeline examined the code changes.

With the code released, we shift our focus from prevention to detecting threats that come from malicious actors. We look for breaches or attacks using currently unknown vulnerabilities.

The National Institute of Standards and Technology (NIST) looks at Continuous Security Monitoring (CSM) as Information Security Continuous Monitoring. In a white paper published in September 2011 (https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-137.pdf), they describe Information Security Continuous Monitoring as the following:

Information security continuous monitoring (ISCM) is defined as maintaining ongoing awareness of information security, vulnerabilities, and threats to support organizational risk management decisions.

The white paper further defines a process to implement CSM by incorporating the following steps:

  1. Defining a strategy by looking at risk tolerance including visibility into assets, vulnerability awareness, current threat information, and impacts on the mission or business.
  2. Establishing a program including definitions of metrics, status monitoring frequency, and the technical architecture.
  3. Implementing the program and collecting security-related information for metrics, assessments, and reporting. Automate the collection, analysis, and reporting as much as possible.
  4. Analyze the findings and report.
  5. Respond to the findings.
  6. Review and update the monitoring program.

The assets that need to be monitored include not only ones directly maintained by the organization but may also extend to third parties and vendors. Monitoring tools may look at the following assets:

  • Known assets: Assets that are part of an organization’s inventory
  • Unknown assets: Forgotten assets including development websites or old marketing sites
  • Rogue assets: Assets created by malicious actors that may impersonate the organization’s domain
  • Vendor assets: Assets owned by third-party vendors

Once identified, the assets should be examined for possible threats and vulnerabilities. A sampling of common threats and vulnerabilities are the following:

  • Unnecessary open TCP/UDP ports: Any open ports may pose a problem if the service that communicates on those ports is misconfigured or unpatched, potentially allowing a vulnerability.
  • Man-in-the-middle attacks: This is a cyber attack where the attacker is between two parties connected together. The parties believe they have a direct connection, but the attacker may listen in and even change the messages before transmitting to the other party.
  • Poor email security: This may leave your organization open to email spoofing.
  • Domain hijacking: An attacker changes the registration of an organization’s domain name without permission of the domain’s owner.
  • Cross-site scripting (XSS) vulnerabilities: Attackers can inject client-side scripts on web pages allowing access control.
  • Leaked credentials: Discovered through data breaches, they allow attackers access into an organization.
  • Data leaks: Exposure of private or sensitive data.
  • Typosquatted domains: This is a form of cybersquatting where attackers claim a domain name similar to a known organization’s domain name in hopes that someone will incorrectly type a URL and enter the attacker’s site.

Identification of these attacks may be part of an assessment created after automated monitoring. The mitigation steps and action will outline who in the organization will perform the remediation steps and coordinate the response.

Architecting for operations

The support activities performed in this stage of the Continuous Delivery Pipeline will have a profound effect on the architecture of the system and may even drive the future direction of the product or solution. These things may be part of the architectural decisions the system architect makes when looking at new capabilities in the Continuous Exploration stage of the Continuous Delivery Pipeline.

Decisions found in this stage that may be taken to Continuous Exploration include the following items:

  • Fixes and new automation created by the site reliability engineers
  • Changes because of flaws discovered in the disaster recovery plan that affect the configuration of the test, staging, or production environments.
  • New vulnerabilities that have been discovered by CSM that the product’s architecture must prevent from occurring in the future.

The system architect thus becomes the balance point of intentional architecture, overseeing the desired architecture of the product or solution, and of emergent design, where other factors such as the environment play a role in requiring changes to the architecture.

The learning at this stage is not limited to the architectural aspects. We also have to evaluate from a product performance standpoint whether the benefit hypothesis was fulfilled by our development. For this, we need to measure our solution’s value. Let’s look at this in our next section.

Measuring the value

Throughout the design and development journey through the Continuous Delivery Pipeline, we have subjected our changes to an array of testing. We will now look at the final test to answer the question: is our development effort bringing value to the customers to the point that this benefits both the customer and the organization?

To aid us in answering this question, we will look at the following activities:

  • Innovation accounting
  • Proving/disproving the benefit hypothesis

We will first revisit innovation accounting and its source: the Lean Startup Cycle. Based on that knowledge, we will see how leading and lagging indicators prove or disprove the benefit hypothesis we created in Continuous Exploration.

Innovation accounting

We first saw innovation accounting in Chapter 5, Measuring the Process and Solution. In the chapter, according to The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses by Eric Ries, we saw that measures were important to gauge whether the benefit hypothesis had been proven.

Ries expands upon innovation accounting in his follow-up book, The Startup Way: How Modern Companies Use Entrepreneurial Management to Transform Culture and Drive Long-Term Growth. In his book, he gives the following definition for innovation accounting:

Innovation Accounting (IA) is a way of evaluating progress when all the metrics typically used in an established company (revenue, customers, ROI, market share) are effectively zero.

He proposes three levels of innovation accounting, each with a different set of metrics to collect. Let’s look at these levels now.

Level 1: Dashboard metrics

These metrics serve as a starting point. Ries advises setting up a dashboard. On this initial dashboard are customer-facing metrics the development teams think are important. These types of metrics are based on per-customer input. The following per-customer metrics for learning include the following:

  • Conversion rates (usually the percentage of customers who move from a free version to a paid version of a product)
  • Revenue per customer
  • Lifetime value per customer
  • Retention rate
  • Cost per customer
  • Referral rate
  • Channel adoption

Metrics like these help enforce the idea that development can affect these metrics after observing the outcomes of their efforts. Metrics here that are visible help keep the development process aligned with feedback.

Level 2: Business case metrics

We proceed deeper with a different set of metrics in Level 2. Here, we try to quantify the “leap of faith assumptions” as Ries calls them. Leap of faith assumptions come in the following two categories:

  • Value assumptions: These describe the value that users receive from the product or solution
  • Growth assumptions: How do new users find the product?

These types of assumptions are needed for any new development, otherwise nothing new would be developed. These assumptions get tested by the development of the MVP and other validated exercises.

The following metrics illustrate the assumptions and categories of assumptions:

  • Retention rate (value assumption)
  • Referral rates (value assumption)
  • Word-of-mouth referrals (growth assumption)

Value assumptions look for customer behavior. They are met based on positive user behavior. Growth assumptions are looking for sustainable growth.

Level 3: Net present value

At this level, we look at performance over a long period of time. Changes will come as you acquire new data and re-evaluate or compare the present data over what was forecasted. You look at long-term drivers of the product’s future performance.

The following metrics may provide guidance over the long term:

  • Number of website users
  • Percentage of visitors that become users
  • Percentage of paid users
  • Average price paid by a user

These metrics often involve more than the development teams. Finance may also be involved with the goal of shifting the focus at this point to the financial performance of the product.

Proving/disproving the benefit hypothesis

Now that we’ve seen the steps taken with innovation accounting, let’s see how they play against the benefit hypothesis we created in Continuous Exploration.

In our journey through the Continuous Delivery Pipeline, we created our benefit hypothesis in Continuous Exploration. To measure the validity of our hypothesis, we may be incorporating Level 2 metrics in our dashboard. Our dashboard is acquiring the metrics automatically through the full-stack telemetry that we have designed into our system in our test, staging, and production environments.

We may get some indications from testing done at Continuous Deployment, but the real measurements will come from Release on Demand, first during any A/B testing or canary releases, but also upon general release. We want to see all correlations between the performance and our initial benefit hypothesis.

The data we collect and analyze now in this step may not be just to improve the product, but also to improve our value stream and our development process. Let’s explore our next section to see how both are accomplished.

Learning from the outcomes

Based on our learning, we now need to figure what’s the best next step. This is true from the perspective of our product as well as the perspective of our value stream.

For our product, it’s a matter of determining the best direction. This may mean whether it’s time to pivot, change direction in our overall product strategy, or persevere or continue in the same direction.

For our value stream, this time is used to reflect on how to improve. What lessons are there that we can learn from to improve?

The following practices are used to determine future product direction as well as future direction of our value stream:

  • Pivot or persevere from Lean Startup
  • Relentless improvement
  • Value stream mapping sessions

We perform this learning so we can begin again with renewed focus at the start of the Continuous Delivery Pipeline in setting up ideas to execute. We improve our value streams to improve the performance of our Continuous Delivery Pipeline.

Let’s look at these practices that allow us to improve.

Pivot or persevere

In Chapter 10, Continuous Exploration and Finding New Features, we viewed the application of Eric Ries’s Build-Measure-Learn cycle into the SAFe® Lean Startup cycle where we saw how Epics are created with a benefit hypothesis. The Epics are implemented with a Minimum Viable Product (MVP), which acts as an experiment to prove or disprove the benefit hypothesis. The MVP is evaluated through innovation accounting and tracking of leading indicators that will be used to determine the pivot or persevere decision.

We are now at that point of that decision. For our ART, the MVP may be a few features created at the beginning of the Continuous Delivery Pipeline. We have developed these features, tested them, and deployed them into the production environment through the Continuous Integration and Continuous Deployment stages of our pipeline. Now, in Release on Demand, we show our MVP as features to the user population to see whether the benefit hypothesis is proven. Based on the innovation accounting metrics we are collecting through our full-stack telemetry, we come to the following two decisions:

  • Pivot: Our feature didn’t meet the benefit hypothesis. It’s time to move in a different direction. This may include stopping development in that product direction.
  • Persevere: Our benefit hypothesis was validated. We should continue to develop further features to enhance our MVP.

Note that even after our persevere decision for the MVP, our features will still be evaluated to determine that they prove to be valuable and that the product direction still resonates with our customers.

Relentless improvement

We first saw relentless improvement when we first examined the SAFe House of Lean in Chapter 2, Culture of Shared Responsibility, We mentioned that in relentless improvement, we looked for opportunities to be better because of that “hidden sense of danger.”

Throughout development, the teams and the ART have looked for such opportunities to improve the flow of value. The teams have been regularly holding retrospectives at the end of each iteration to identify problems at the team level and ART holds the Inspect and Adapt (I&A) event at the end of the Program Increment to look at systemic issues.

Other improvements may come to the Continuous Delivery Pipeline itself. Newer tools, additional testing, and continued maintenance allow the ART to maintain or improve how they create the experiments to validate benefit hypotheses.

Additional value stream mapping sessions

Another important part of relentless improvement comes from Value Stream Management and continuous learning, ideas originally discussed in Chapter 9, Moving to the Future with Continuous Learning.

One activity we originally performed in our value stream mapping was not only mapping our value stream as it stands currently, but also identifying an ideal future-state value stream. Improvement actions could come from small, iterative changes toward that ideal value stream.

One other step for improvement is to hold a value stream mapping session at least once a year to evaluate the value stream in its current iteration. This allows the ART to view the present bottlenecks impeding the flow of value. During this value stream mapping session, a new future-state value stream can be identified for new improvements.

Summary

In this chapter, we have reached the last stage of the Continuous Delivery Pipeline: Release on Demand. After using feature flags to allow testers to view new changes in the development environment, we can use them to incrementally release those changes to a small population of users in a canary release. We may also want to set up our architecture to allow for each component to have different release cadences. After release, we want to ensure that the changes don’t disrupt the environment and that our overall solution is stable. To do that, we adhere to the principles and primary practices of SRE, including preparing for disaster recovery.

With the changes released in a stable environment, it’s time to measure business results through full-stack telemetry. We selected our measures during Continuous Exploration, relying on innovation accounting principles. Looking at these measures on dashboards visible to everyone, we can try to determine whether our benefit hypothesis that we created in Continuous Exploration is valid.

Based on whether the benefit hypothesis is valid or not, we must decide to pivot, change direction, or persevere and continue developing in the same product direction. After making that decision, the ART looks at other improvement opportunities through retrospectives, I&A, and regularly mapping their value stream.

This brings us to a close on Part 3. We now will take a look at emerging trends as well as some tips and tricks for success in your DevOps adoption in our last chapter of the book.

Questions

  1. Which of these are key practices or principles of SRE? (pick 3)
    1. Use feature flags for A/B testing
    2. Reduce toil mainly through automation
    3. Run unit tests in production
    4. Know how much risk you can handle through error budgets
    5. Chaos engineering
    6. Test in production
  2. Which of these should be part of a disaster recovery plan? (pick 3)
    1. Disaster recovery team identified
    2. Resource identification
    3. Vendor contact information
    4. Server schematics
    5. Backup schedule
    6. Hard-copy versions of important files
  3. Backups are performed on the database every two hours. If the database server crashes, what is the expected RPO?
    1. 2 hours
    2. 4 hours
    3. 8 hours
    4. 16 hours
  4. What problems can CSM detect? (pick 2)
    1. Man-in-the-middle attacks
    2. License violations
    3. Cross-site Scripting (XSS) vulnerabilities
    4. Weak passwords
  5. What practices are NOT examples of relentless improvement? (pick 2)
    1. Retrospectives
    2. Code reviews
    3. Inspect and Adapt (I&A)
    4. Adding tests to the Continuous Delivery Pipeline
    5. Code comments

Further reading

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.14.235