Chapter 13

Testing, Automation, and Changes

Images

CERTIFICATION OBJECTIVES

13.01     Testing Techniques

13.02     Automation and Orchestration

13.03     Change and Configuration Management

Images     Two-Minute Drill

Q&A   Self Test


One of the challenges of a cloud environment is service and maintenance availability. When organizations adopt a cloud model instead of hosting their own infrastructure, it is important for them to know that the services and data they need to access are available whenever and wherever they need them, without experiencing undue delays. Therefore, organizations need procedures for testing the cloud environment. Testing is a proactive measure to ensure consistent performance and operations of information systems.

Because maintenance tasks associated with on-premises technology are assumed by the cloud provider, operational and specific business rules need to be employed to best leverage the cloud technology to solving business problems. Standard maintenance tasks can take a significant amount of time, time that many companies just do not have. IT resources are typically stretched thin. This is where automation and orchestration step in. Automation and orchestration are two ways to reduce time spent on tasks, increase the speed of technological development, and improve efficiencies. Automation uses scripting, scheduled tasks, and automation tools to programmatically execute workflows that were formerly performed manually. Orchestration manages automation workflows to optimize efficiencies and ensure effective execution.

Improper documentation can make troubleshooting and auditing extremely difficult. Additionally, cloud systems can become extremely complex as they integrate with more systems, so a change in one place can impact systems across the enterprise if not managed correctly. Change and configuration management address these issues by tracking change requests, establishing a process for approval that considers the risks and impacts, and tracking the changes that are actually made.

CERTIFICATION OBJECTIVE 13.01

Testing Techniques

Availability does not just mean whether services are up or down. Rather, availability is also concerned with whether services are operating at expected performance levels. Cloud consumers have many options, and it is relatively easy for them to switch to another model, so it is important for cloud providers to provide consistently the level of service users expect.

Ensuring consistent performance and operation requires vigilant testing of services such as cloud systems or servers, virtual appliances, virtual networking, bandwidth, and a host of more granular metrics. Together, this data can paint a picture of where constraints may lie and how performance changes when conditions change. In this section, you will learn about the following testing techniques:

Images   Baseline comparisons

Images   Performance testing

Images   Configuration testing

Images   Testing in the cloud landscape

Images   Validating proper functionality

Images   SLA comparisons

Images   Testing sizing changes

Images   Testing high availability

Images   Testing connectivity

Images   Verifying data integrity

Images   Evaluating replication

Images   Testing load balancing

Baseline Comparisons

A baseline can be compared with actual performance metrics at any point following collection of the baseline to determine if activity represents the norm. The purpose of establishing a baseline is to create a sample of resources that are being consumed by the cloud services, servers, or virtual machines over a set time period and to provide the organization with a point-in-time performance chart of its environment.

Establish a baseline by selecting a sampling interval and the objects to monitor and then collecting performance data during that interval. Continue to collect metrics at regular intervals to get a chart of how systems are consuming resources.

The cloud providers offer the ability for each cloud virtual machine to send performance metrics to a central monitoring location in the cloud, such as with AWS CloudWatch. This also provides aggregated metrics for apps that consist of multiple virtual machines.

Procedures to Confirm Results

It is important to establish procedures to evaluate performance metrics in a baseline both in testing and production to confirm the accuracy of testing baselines. Load testing, stress testing, and simulated user behavior utilize workload patterns that are automated and standard, but these workloads may not be the same as actual user activity. Therefore, baseline comparisons from load testing, stress testing, or simulations could differ significantly from production baselines. Understand how baselines were obtained so that you do not use these baselines in the wrong context.

Decision makers use baselines to determine initial and max resource allocations, to gauge scalability, and to establish an application or service cost basis, so baseline numbers need to be accurate. Evaluate metrics for critical resources, including CPU, memory, storage, and network utilization.

CPU Utilization   CPU utilization may change as systems are moved from test to production and over time as overall utilization changes. Collect CPU metrics once systems have been moved to production and track the metrics according to system load so that numbers can be compared to testing baselines. The following list includes some of the metrics to monitor.

Images   CPU time Shows the amount of time a process or thread spends executing on a processor core. For multiple threads, the CPU time of the threads is additive. The application CPU time is the sum of the CPU time of all the threads that run the application. If an application runs multiple processes, there will be a CPU time associated with each process and these will need to be added together to get the full value.

Images   Wait time   Shows the amount of time that a given thread waits to be processed.

Memory Utilization   Memory utilization is also subject to changes in test versus production. Collect memory metrics once systems have been moved to production and track the metrics according to system load so that numbers can be compared to testing baselines. Some metrics to monitor include

Images   Paged pool   Shows the amount of data that has been paged to disk. Paging from disk is performed when there is insufficient memory available, and it results in lower performance for each page fault.

Images   Page faults   Shows the total number of times data was fetched from disk rather than memory since process launch. A high number of page faults could indicate that memory needs to be increased.

Images   Peak memory usage   Shows the memory used by a process since it was launched.

Storage Utilization   After evaluating CPU and memory performance compared to proposed resources, an organization must also test the performance of the storage system. Identifying how well the storage system is performing is critical in planning for growth and proper storage management. Collect storage utilization data in production. The performance metrics can be compared to baselines if load values can be associated with each metric for common comparison. Some metrics to monitor include

Images   Application read IOPS   Shows how much storage read I/O was performed by the application process per second. Storage read I/O is when data is pulled from disk for the application. If an application runs multiple processes, there will be a read IOPS value associated with each process and these values will need to be added together to get the full value.

Images   Application write IOPS   Shows how much storage write I/O was performed by the application process per second. Storage write I/O is when the application saves data to disk. Similar to read IOPS, if an application runs multiple processes, there will be a write IOPS value associated with each process.

Images   Read IOPS   Shows how much storage read I/O was performed by the system per second.

Images   Write IOPS   Shows how much storage write I/O was performed by the system per second.

Network Utilization   The fourth item to consider is network utilization. This also can change from test to production and as systems mature. Collect network metrics once systems have been moved to production. You can use network collection tools or collect statistics from network devices or the virtual machines themselves. Sometimes it is useful to collect network metrics at different points and then compare the results according to system load so that numbers can be compared to testing baselines. Some metrics to monitor include

Images   Physical NIC average bytes sent/received   Tracks the average amount of data in bytes that was sent and received over the physical network adapter per second.

Images   Physical NIC peak bytes sent/received   Tracks the largest values for the average amount of data in bytes that were sent and received over the physical network adapter. Peak values can show whether the adapter is getting saturated.

Images   Virtual switch average bytes sent/received   Tracks the average amount of data in bytes that were sent and received over the virtual switch per second. Track this for each virtual switch you want to monitor.

Images   Virtual switch peak bytes sent/received   Tracks the largest values for the average amount of data in bytes that were sent and received by the virtual switch. Track this for each virtual switch you want to monitor. Peak values can show whether the virtual switch is getting saturated.

Images   Virtual NIC average bytes sent/received   Tracks the average amount of data in bytes that were sent and received over the virtual NIC per second. Track this for each virtual NIC in the virtual machines you want to monitor.

Patch Version   Performance baselines can change following patch deployment. Patches can change the way the system processes data or the dependencies involved in operations. Security patches, in particular, can add overhead to common system activities, necessitating an update of the baseline on the machine.

Application Version   Performance baselines can change following application updates. Application vendors introduce new versions and patches when they want to address vulnerabilities, fix bugs, or add features. Each of these could result in performance changes to the application as the code changes. Applications may process data in a different way or require more processing to implement more secure encryption algorithms, more storage to track additional metadata, or more memory to load additional functions or feature elements into RAM.

Auditing Enabled   Enabling auditing functions can significantly impact performance baselines. Auditing logs when certain actions are taken such as failed logins, privilege use, error debug data, and other information useful for evaluating security or operational issues. However, tracking and storing this data increases the overhead of running processes and applications on the server and can have a big impact on the baseline.

If auditing is temporary, schedule auditing and baselining activities to occur on different dates if possible. If not, make operations aware that baseline values will be impacted due to auditing.

Management Tool Compliance   Management tools improve the process of collecting and analyzing data on performance metrics from testing, production, and other metrics and baselines collected as time progresses by putting all the data in one place where it can be easily queried, analyzed, and reported on.

Management tools can provide a dashboard of metrics that can be tweaked for operations teams to see changes, trends, and potential resource problems easily. Management tools may need to interface with cloud APIs or have an agent running on hypervisors to collect the data.

Performance Testing

A common item that is tested is how well the program or system performs. Cloud adoption is a combination of new systems and migrations of existing physical or virtual systems into the cloud. New systems will need to have new benchmarks defined for adequate performance. However, you can start with some generic role-based performance benchmarks until the applications have been tested under full load and actual performance data has been captured.

Existing systems should have performance metrics associated with ideal operational speed already, so implementers will be looking to meet or exceed those metrics in cloud implementations.

Performance metrics on systems without a load can show what the minimum resource requirements will be for the application, since the resources in use without activity on the machine represent the bare operational state. However, to get a realistic picture of how the system will perform in production and when under stress, systems will need to go through load testing and stress testing.

Load Testing

Load testing evaluates a system when the system is artificially forced to execute operations consistent with user activities at different levels of utilization. Load testing emulates expected system use and can be performed manually or automatically. Manual load testing consists of individuals logging into the system and performing normal user tasks on the test system, while automated testing uses workflow automation and runtimes to execute normal user functions on the system.

Stress Testing

Stress testing is a form of load testing that evaluates a system under peak loads to determine its max data or user handling capabilities. Stress testing is used to determine how the application will scale. Stress testing can also be used to determine how many standard virtual machine configurations of different types can exist on a single hypervisor. This can help in planning for ideal resource allocation and host load balancing.

Systems can be tested as a whole, or they can be tested in isolation. For example, testing of the web server in a test system may have allowed testers to identify the max load the web server can handle. However, in production, the operations team will deploy web servers in a load-balanced cluster. The data so far tells how many servers would be needed in the cluster for the expected workload, but the testing team needs to determine how many database servers would be required for the workload as well. Rather than spin up more web servers, the team can run a trace that captures all queries issued to the database server in the web server testing. They can then automate issuing those same queries multiple times over to see how many database servers would be required for the number of web servers.

Continuing the example, let’s assume that testing revealed that the web server could host up to 150 connections concurrently. The traffic from those 150 connections was captured and then replayed to the database server in multiples until the database server reached max load. The testing team might try doubling the traffic, then tripling it, and so forth until they reach the max load the database server can handle; for this example, we will say that it is five times the load, or the load of 750 end-user connections. The testing team would typically build in a buffer since it is not good to run systems at 100 percent capacity, and then they would document system scalability requirements. In this example, assuming ideal load is 80 percent, or 120 connections instead of 150, the team would document that one web server should be allocated for each 120 concurrent user connections and that one database server should be allocated for every 600 concurrent user connections.

Taking this just a little bit further, the organization could configure rules to spin up a new virtual machine from a template based on concurrent connections so that the system as a whole auto-scales.

Remote Transactional Monitoring

After systems are deployed, operations teams will want to know how different functions within the application are performing. End users do not typically provide the best information on what is actually causing a performance problem, but remote transactional monitoring can simulate user activity and identify how long it takes to perform each task. Cloud-based options can be deployed at different locations around the world to simulate the user experience in that region.

Remote transactional monitoring, deployed in this fashion, can be used by operations teams to isolate tasks that exceed established thresholds and determine whether thresholds differ around the globe. They can then isolate the individual system processes that are contributing to the performance issue and determine whether the performance issue is localized to the edge.

Available vs. Proposed Resources

The elasticity associated with virtualization and cloud computing can result in different resources being available to a virtual machine than were proposed in requirements. Elastic computing allows computing resources to vary dynamically to meet a variable workload. (See Chapter 6 for more details.) Operations teams typically deploy machines with fewer resources than were proposed, but with a ceiling for growth.

Compute   It is best to allocate one vCPU to a virtual machine and then monitor performance, adding additional vCPUs as needed. When a virtual machine attempts to use a vCPU, the hypervisor must wait for the physical CPU associated with that vCPU to become available. The virtual machine believes that vCPU to be idle and will attempt to spread the load around if the application is configured for multiprocessing, but this can have an adverse impact on virtual machine performance if the physical CPU has a large number of processes in the queue. Furthermore, even idle processors place some load on the hypervisor from host management processes, so it is best not to provision more than will be necessary.

Where possible, monitor hypervisor metrics to determine if overcommitment bottlenecks are occurring. The most important metric to watch is the CPU ready metric. CPU ready measures the amount of time a virtual machine has to wait for physical CPU to become available. It is also important to monitor CPU utilization within each virtual machine and on the host. High CPU utilization might indicate the need for additional vCPUs to spread the load, while high host CPU utilization could indicate whether virtual machines are properly balanced across hosts. If one host has high CPU utilization and others have available resources, it may be best to migrate some of the virtual machines to another host to relieve the burden on the overtaxed host.

Memory   When configuring dynamic memory on virtual machines, ensure that you set both a minimum and a maximum. Default configurations typically allow a virtual machine to grow to the maximum amount of memory in the host unless a maximum is set. There are hypervisor costs to memory allocation that you should be aware of. Based on the memory assigned, hypervisors will reserve some amount of overhead for the virtual machine kernel and the virtual machine. VMware has documented overhead for virtual machines in its “VM Right-Sizing Best Practice Guide.” According to VMware’s guide, one vCPU and 1GB of memory allocated to a virtual machine produces 25.90MB of overhead for the host.

There is also the burden of maintaining overly large shadow page tables. Shadow page tables are the way hypervisors map host memory to virtual machines and how the virtual machine perceives the state of memory pages. This is necessary because virtual machines cannot access memory directly or they could potentially access the memory of other virtual machines.

These constraints can put unnecessary strain on a hypervisor if resources are overallocated. For this reason, keep resources to a minimum until they are actually required.

Configuration Testing

Configuration testing allows an administrator to test and verify that the cloud environment is running at optimal performance levels. Configuration testing needs to be done on a regular basis and should be part of a weekly or monthly routine. When testing a cloud environment, a variety of aspects need to be verified.

Data Access Testing

The ability to access data that is stored in the cloud and hosted with a cloud provider is an essential function of a cloud environment. Accessing that data needs to be tested for efficiency and compliance so that an organization has confidence in the cloud computing model.

Network Testing

Testing network latency measures the amount of time between a networked device’s request for data and the network’s response from the requester. This helps an administrator determine when a network is not performing at an optimal level.

In addition to testing network latency, it is also important to test the network’s bandwidth or speed. Standard practice for measuring bandwidth is to transfer a large file from one system to another and measure the amount of time it takes to complete the transfer or to copy the file. The throughput, or the average rate of a successful message delivery over the network, is then determined by dividing the file size by the time it takes to transfer the file and is measured in megabits or kilobits per second. However, this test does not provide a maximum throughput and can be misleading because of overhead factors.

When determining bandwidth and throughput, it is important to understand that overhead needs to be accounted for, like network latency and system limitations. Dedicated software can be used to measure the throughput (e.g., NetCPS and iPerf) to get a more accurate measure of maximum bandwidth. Testing the bandwidth and latency of a network that is supporting a cloud environment is important since the applications and data that are stored in the cloud would not be accessible without the proper network configurations.

Application Testing

After moving an application to the cloud or virtualizing an application server in the cloud, testing of that application or server will need to be performed at regular intervals to ensure consistent operation and performance. There are a variety of different ways to test an application: some can be done manually, and some are automated.

Containerization is extremely effective at this point. Application containers are portable runtime environments that contain an application along with its dependencies such as frameworks, libraries, configuration files, and binaries. These are all bundled into the container, which can run on any system with compatible container software. This allows application testing teams to deploy multiple isolated containers to the cloud for simultaneous testing.

Performance counters are used to establish an application baseline and verify that the application and the application server are performing at expected levels. Monitor performance metrics and set alerting thresholds to know when applications are nearing limits. Baselines and thresholds are discussed in more detail in the “Resource Monitoring Techniques” section of Chapter 7. Batch files or scripts can easily automate checking the availability of an application or server or collecting performance metrics.

Applications need to be delivered seamlessly so that the end user is unaware the application is being hosted in a cloud environment. Tracking this information can help determine just how seamless that delivery process is.

A variety of diagnostic tools can be used to collect information about how an application is performing. To test application performance, an organization needs to collect information about the application, including requests and number of connections. The organization also needs to track how often the application is being utilized as well as overall resource utilization (memory and CPU).

Performance monitoring tools are valuable in evaluating application performance. Such tools can create reports on how quickly an application loads or spins up and analyze performance data on each aspect of the application as it is being delivered to the end user.

Follow this simple process in assessing application performance:

Images   Evaluate which piece of an application or service is taking the most time to process.

Images   Measure how long it takes each part of the program to execute and how the program is allocating its memory.

Images   Test the underlying network performance, storage performance, or performance of cloud virtual infrastructure components if using IaaS or PaaS.

In addition to testing the I/O performance of its storage system, a company can use a variety of tools for conducting a load test to simulate what happens to the storage system as the load is increased. Testing the storage system allows the organization to be more proactive than reactive with its storage and helps it plan for when additional storage might be required.

Testing in the Cloud Landscape

Cloud resources can be used as a “sandbox” of sorts to test new applications or new versions of applications without affecting the performance or even the security of the production landscape. Extensive application and application server testing can be performed on applications migrated to the cloud landscape before making cloud-based applications available to the organization’s users. Testing application and application server performance in the cloud is a critical step to ensuring a successful user experience.

Testing the application from the server hosting the application to the end user’s device using the application, and everything in between, is a critical success factor in testing the cloud environment.

Validating Proper Functionality

Systems are created to solve some business problem or to enhance a key business process. It is important, therefore, to test the execution of those activities to ensure that they function as expected. Functionality can be tested with the steps shown in Figure 13-1. These steps include identifying actions the program should perform, defining test input, defining expected output, running through each test case, and comparing the actual output against the expected output.

FIGURE 13-1   Functional testing steps

Images

Quality assurance (QA) teams create a list of test cases, each with an establish input that should produce a specific output. For example, if we create a user, a report of all users should display the user. If we instruct the application to display the top ten customer accounts with the largest outstanding balances, we should be able to verify that the ones displayed actually have the highest balances.

SLA Comparisons

A service level agreement (SLA) is a contract that specifies the level of uptime that will be supported by the service provider as well as expected performance metrics. SLAs include provisions for how the service provider will compensate cloud consumers if SLAs are not met. At a minimum, these include some monetary compensation for the time the system was down that may be credited towards future invoices. Sometimes the provider must pay fines or damages for lost revenue or lost customer satisfaction.

SLAs are specific to a cloud service, meaning that the SLA for cloud storage might differ from the SLA for cloud databases, cloud virtual machines, and so on.

A multilevel SLA is used when different types of cloud consumers use the same services. These are a bit more complicated to read, and cloud consumers must understand which type they are to understand their expectations for availability. Multilevel SLAs are useful for cloud providers that provide a similar solution to different customer types because the same SLA can be provided to each customer. For example, a web hosting company may have different levels of service that are specified based on the customer type.

A service-based SLA describes a single service that is provided for all customers. There is no differentiation between customer expectations in a service-based SLA, unlike a multilevel SLA. For example, an Internet service provider (ISP) has specified the same SLA terms for each non-business customer.

In comparison, a customer-based SLA is an agreement that is unique between the customer and service provider. Since business terms can vary greatly, the ISP in this example may use customer-based SLAs for business customers and the service-based SLA for home users.

It is important to understand which SLAs are in place. As a customer, you should be aware of what your availability expectations are so that you can work around downtime. As a provider, you need to understand your responsibility for ensuring a set level of availability and the consequences of not living up to that agreement. The establishment of SLAs is an important part of ensuring adequate availability of key resources so that the company can continue doing business and not suffer excessive losses.

Testing Sizing Changes

Sizing is performed for hosts and their guests. First, the host must be provisioned with sufficient resources to operate the planned virtual machines with a comfortable buffer for growth. This step is not necessary if you are purchasing individual machines from a cloud vendor, because it is the cloud vendor’s responsibility to size the hosts that run the machines it provides to you. However, this is an important step if you rent the hypervisor itself (this is called “dedicated hosting”) in the cloud, as many companies do in order to give them flexibility and control in provisioning.

One of the many benefits of virtualization is the ability to provision virtual machines on the fly as the organization’s demands grow, making the purchase of additional hardware unnecessary. If the host computer is not sized correctly, however, it is not possible to add virtual machines without adding compute resources to the host computer or purchasing additional resources from the cloud vendor.

It is a best practice to allocate fewer resources to virtual machines and then analyze performance to scale as necessary. It would seem that overallocating resources would not be a problem if the resources are available, and many administrators have fallen for this misconception. Overallocating resources results in additional overhead to the hypervisor and sometimes an inefficient use of resources.

For example, overallocating vCPUs to a virtual machine could result in the virtual machine dividing work among multiple vCPUs only to have the hypervisor queue them up because not enough physical cores are available. Similarly, overallocating memory can result in excessive page files consuming space on storage and hypervisor memory overhead associated with memory page tracking.

Test each resource independently so that you can trace the performance impact or improvement to the resource change. For example, make a change to vCPU and then test before changing memory.

Testing High Availability

High availability (HA) was introduced in Chapter 12. As HA systems are deployed, they should also be tested to ensure that they meet availability and reliability expectations. Test the failure of redundant components such as CPU, power supplies, Ethernet connections, server nodes, storage connections, and so forth. The HA system should continue to function when you simulate failure of a redundant component. It is also important to measure the performance of a system when components fail. When a drive fails in parity-based RAID (RAID 5, RAID 6, RAID 50), the RAID set must rebuild. This can have a significant performance impact on the system as a whole. Similarly, when one node in an active/active cluster fails, all applications will run on the remaining node or nodes. This can also affect performance. Ensure that scenarios such as these are tested to confirm that performance is still acceptable under the expected load when components fail.

You can test the drive rebuild for a RAID set by removing one drive from the set and then adding another drive in. Ensure that the drive you add is the same as the one removed. You can simulate the failure of one node in a cluster by powering that node down or by pausing it in the cluster.

Testing Connectivity

In addition to this end-to-end testing, an organization needs to be able to test the connectivity to the cloud service. Without connectivity to the cloud that services the organization, the organization could experience downtime and costly interruptions to its data. It is the cloud administrator’s job to test the network for things such as network latency and replication and to make sure that an application hosted in the cloud can be delivered to the users inside the organization.

Verifying Data Integrity

Data integrity is the assurance that data is accurate and that the same data that is stored in the cloud is the data that is later retrieved. The data remains unchanged by unauthorized processes. Application data is valuable and must be protected against corruption or incorrect alteration that could damage its integrity. Data integrity testing can be combined with other testing elements to ensure that data does not change unless specified. For example, functional testing reviews each test case along with the specified input and output. Data integrity testing would detect data integrity issues with the direct test case, but automated integrity checks could also be built-in to verify that all other data has not changed in the process.

Similarly, when performing load testing, ensure that data does not change as the system reaches its max capacity. Performance constraints can sometimes result in a cascade failure, and you want to ensure that data is adequately protected in such a case.

Evaluating Replication

Some situations require an organization to replicate or sync data between its internal data center and a cloud provider. Replication is typically performed for fault tolerance or load balancing reasons. After testing network latency and bandwidth, it is important to check and verify that the data is replicating correctly between the internal data center and the cloud provider or between multiple cloud services.

Test by making a change on one replication partner and then confirm that the change has taken place on the other replication partners. Measure how long it takes to complete replication by creating replication metrics. Combine replication testing with load testing to ensure that replication can stay consistent even under heavy loads and to determine at what point replication performance decreases.

Testing Load Balancing

Some services can be configured to run on multiple machines so that the work of processing requests and servicing end users can be divided among multiple servers. This process is known as load balancing.

With load balancing, a standard virtual machine can be configured to support a set number of users. When demand reaches a specific threshold, another virtual machine can be created from the standard template and joined into the cluster to balance the load.

Load balancing is also valuable when performing system maintenance as systems can be added or removed from the load balancing cluster at will. However, load balancing is different from failover clustering in that machines cannot be failed over immediately without a loss of connection with end users. Instead, load-balanced machines are drain stopped. Drain stopping a node tells the coordinating process not to send new connections to the node. The node finishes servicing all the user requests and then can be taken offline without impacting the overall availability of the system.

When hosting an application in the cloud, there may be times where an organization uses the cloud as a load balancer. As discussed in Chapter 4, load balancing with dedicated software or hardware allows for the distribution of workloads across multiple computers. Using multiple components can help to improve reliability through redundancy, with multiple devices servicing the workload. If a company uses load balancing to improve availability or responsiveness of cloud-based applications, it needs to test the effectiveness of a variety of characteristics, including TCP connections per second, HTTP/HTTPS connections per second, and traffic loads simulated to validate performance under high-traffic scenarios. Testing all aspects of load balancing helps to ensure that the computers can handle the workload and that they can respond in the event of a single server outage.

CERTIFICATION OBJECTIVE 13.02

Automation and Orchestration

Organizations are adopting numerous cloud technologies that offer a myriad of services to them and often involve exchanging data through them. While the cloud provider is responsible for standard maintenance tasks associated with its on-premises technology, each organization is responsible for managing the cloud services it consumes, which includes implementing operational rules and specific business rules to best leverage the cloud technology to fulfill its operational needs. Automation and orchestration are two ways for an organization’s IT staff to reduce the time spent on tasks, increase the speed of technological development, and improve efficiencies. The two concepts of automation and orchestration work well together, but they are not the same thing.

Automation uses scripting, scheduled tasks, and automation tools to programmatically execute workflows that were formerly performed manually. Orchestration manages automation workflows to optimize efficiencies and ensure effective execution.

Orchestration integrates organizational systems to provide more value to the organization. In orchestration, workflow automations are called runbooks, and each discrete step in a runbook is called an activity.

As an example of orchestration integration, an IT person working on a customer trouble ticket could generate change requests from within the ticketing system by selecting the type of change and providing relevant details. This would prompt a set of activities to get the change approved. Once approved, the runbook could be automatically executed, if one exists for the task, with the input provided by the IT person in the ticket. The output from the runbook could then be placed into the change management system and documented on employee time sheets while metrics are gathered for departmental meetings. This is the power of orchestration!

Orchestration requires integration with a variety of toolsets and a catalog of sufficiently developed runbooks to complete tasks. Orchestration often involves a cloud portal or other administrative center to provide access to the catalog of workflows along with data and metrics on workflow processes, status, errors, and operational efficiency. The orchestration portal displays useful dashboards and information for decision-making and ease of administration.

In this scenario, orchestration ensures that tickets are handled promptly, tasks are performed according to SOPs, associated tasks such as seeking change approval and entering time are not forgotten, and the necessary metrics for analytics and reporting are collected and made available to managers.

Interrelationships are mapped between systems and runbooks along with requirements, variables, and dependencies for linked workflows. This facilitates the management of cloud tools and integration of tools and processes with a variety of legacy tools. Management options are improved with each new workflow that automates management that once would have required logging into many different web GUIs, workstations, servers, or traditional applications, or sending SSH commands to a CLI to accomplish.

This section is organized into the following subsections:

Images   Event orchestration

Images   Scripting

Images   Custom programming

Images   Runbook management for single nodes

Images   Orchestration for multiple nodes and runbooks

Images   Automation activities

Event Orchestration

Event logs used to be reviewed by IT analysts who understood the servers and applications in their organization and had experience solving a variety of problems. These individuals commonly used knowledge bases of their own making or those created by communities to identify whether an event required action to be taken and what the best action was for the particular event.

The rapid expansion of the cloud and IT environments, as well as increasing complexity of technology and integrations, has made manual event log analysis a thing of the past. However, the function of event log analysis and the corresponding response to actionable events is still something companies need to perform. They accomplish this through event orchestration.

Event orchestration collects events from servers and devices such as firewalls and virtual appliances in real time. It then parses events, synchronizes time, and executes correlation rules to identify commonalities and events that, together, could form a risk. It ranks risk items, and if they exceed specific thresholds, it creates alerts on the events to notify administrators. Some events may result in runbook execution. For example, excessive logon attempts from an IP address to an FTPS server could result in runbook execution to add the offending IP address to IP block lists. Similarly, malware indicators could quarantine the machine by initiating one runbook to disable the switch port connected to the offending machine and another runbook to alert local incident response team members of the device and the need for an investigation.

Scripting

There are a wide variety of scripting languages, and some may be more applicable to certain uses. Most orchestration tools support a large number of scripting languages, so feel free to use the tools that are most effective for the task or those that you are most familiar with. The advantage of scripting languages is that they are relatively simple to learn, they can run with a small footprint, they are easy to update, and they are widely supported. Some of the most popular scripting languages include Microsoft PowerShell, JavaScript, AppleScript, Ruby, Python, SQL, Shell, Perl, PHP, Go, and R.

Custom Programming

Full-fledged programming languages can be used to create runbooks as well as hook into cloud-based APIs. Programming languages offer more capabilities than scripting languages, and there are powerful development environments that can allow for more extensive testing. Programming languages are also good to use when runbook programs are increasingly complex because programs can be organized into flexible modules. Some of the most popular languages include C++, C#, Java, Python, Scala, and PHP.

Runbook Management for Single Nodes

As previously stated, runbooks are workflows organized into a series of tasks called activities. Runbooks begin with an initiation activity and end with some activity to disconnect and clean up. In-between activities may include processing data, analyzing data, debugging systems, exchanging data, monitoring processes, collecting metrics, and applying configurations. Runbook tools such as Microsoft System Center Orchestrator include plug-in integration packs for Azure and AWS cloud management.

Single-node runbooks are those that perform all activities on a single server or device. For example, this runbook would perform a series of maintenance activities on a database server:

1.   Connect to the default database instance on the server.

2.   Enumerate all user databases.

3.   Analyze indexes for fragmentation and page count.

4.   Identify indexes to optimize.

5.   Rebuild indexes.

6.   Reorganize indexes.

7.   Update index statistics.

8.   Perform database integrity checks.

9.   Archive and remove output logs older than 30 days.

10.   Remove rows from the commandlog table older than 30 days.

11.   Disconnect from the default database instance on the server.

Orchestration for Multiple Nodes and Runbooks

Multiple-node runbooks are those that interface with multiple devices or servers. For example, this runbook would patch all virtual machines on all hosts in the customer cloud:

1.   Enumerate virtual machines on each host.

2.   Create a snapshot of virtual machines.

3.   Add virtual machines to the update collection.

4.   Scan for updates.

5.   Apply updates to each machine, restarting load-balanced servers and domain controllers one at a time.

6.   Validate service states.

7.   Validate URLs.

8.   Remove snapshot after 24 hours of successful runtime.

Automation Activities

A wonderful thing about runbooks is that there is a great deal of community support for them. You can create your own runbooks—and you definitely will have to do that—but you also can take advantage of the many runbooks that others have created and made available in runbook communities. Some of these can be used as templates for new runbooks to save valuable development time. Consider contributing your own runbooks to the community to help others out as well. Microsoft maintains a gallery of runbooks that can be downloaded for implementation on the Microsoft Azure cloud computing platform. Other vendors have set up community runbook repositories as well.

Here are some common runbook activities:

Images   Snapshots Activities can be created to take snapshots of virtual machines or remove existing snapshots or snapshot chains from virtual machines. Runbooks might use this activity when making changes to a virtual machine. The snapshot could be taken first, and then the maintenance would be performed. If things ran smoothly, the snapshot could be removed. Otherwise, the runbook could execute the activity to apply the snapshot taken at the beginning of the runbook. See Chapter 6 for more information on snapshots.

Images   Cloning Activities can be created to make a clone of a virtual machine. A runbook using this activity might combine it with an activity to create a template from the clone or an activity to archive the clone to secondary storage or an activity to send the clone to a remote replication site and create a virtual machine from it. See Chapter 6 for more information on cloning.

Images   User account creation Activities can be created to create a user account based on form input. For example, the activity could create the account in Windows Active Directory and on several cloud services and then add the user account to groups based on the role provided on the form input.

Images   Permission setting Activities can be created to apply permissions to a group of files, folders, and subfolders.

Images   Resource access Activities can be created to assign resources to a device such as storage LUNs, virtual NICs, or other resources. This can be useful for runbooks that provision virtual machines or provision storage.

Images   User account management Activities can be created to reset user account passwords, disable accounts, unlock accounts, or activate accounts.

CERTIFICATION OBJECTIVE 13.03

Change and Configuration Management

The process of making changes to the cloud environment from its design phase to its operations phase in the least impactful way possible is known as change management. Configuration management ensures that the assets required to deliver services are adequately controlled, and that accurate and reliable information about those assets is available when and where it is needed.

Change management and configuration management support overall technology governance processes to ensure that cloud systems are managed appropriately. All change requests and configuration items need to be documented to make certain that the requirements documented as part of the strategy phase are fulfilled by its corresponding design phase.

Change Management

Change management is a collection of policies and procedures that are designed to mitigate risk by evaluating change, ensuring thorough testing, providing proper communication, and training both administrators and end users.

A change is defined as the addition, modification, or removal of anything that could affect cloud services. This includes modifying system configurations, adding or removing users, resetting accounts, changing permissions, and a host of other activities that are part of the ordinary course of cloud operations, and also includes conducting project tasks associated with upgrades and new initiatives. It is important to note that this definition is not restricted to cloud components; it should also be applied to documentation, people, procedures, and other nontechnical items that are critical to a well-run cloud environment. The definition is also important because it debunks the notion that only “big” changes should follow a change management process. However, it is often the little things that cause big problems, and thus change management needs to be applied equally to both big and small changes.

Change management maximizes business value through modification of the cloud environment while reducing disruption to the business and unnecessary cloud expense due to rework. Change management helps to ensure that all proposed changes are both evaluated before their implementation and recorded for posterity. Change management allows companies to prioritize, plan, test, implement, document, and review all changes in a controlled fashion according to defined policies and procedures.

Change management optimizes overall business risk. It does this by building a process of evaluating both the risks and the benefits of a proposed change in the change procedure and organizational culture. Identified risks contribute to the decision to either approve or reject the change.

Lastly, change management acts as a control mechanism for the configuration management process by ensuring that all changes to configuration item baselines in the cloud environment are updated in the configuration management system (CMS).

A change management process can be broken down into several constituent concepts that work together to meet these objectives. These concepts are as follows:

Images   Change requests

Images   Change proposals

Images   Change approval or rejection

Images   Change scheduling

Images   Change documentation

Images   Change management integration

Change Requests

A request for change (RFC) is a formal request to make a modification that can be submitted by anyone who is involved with or has a stake in that particular item or service. IT leadership may submit changes focused on increasing the profitability of a cloud service; a systems administrator may file a change to improve system stability; and an end user may submit a change that requests additional functionality for their job role. All are valid requests for change.

Change Request Types   Change request types are used to categorize both the amount of risk and the amount of urgency each request carries. There are three types of changes: normal changes, standard changes, and emergency changes.

Normal changes are changes that are evaluated by the defined change management process to understand the benefits and risks of any given request. Standard changes request a type of change that has been evaluated previously and now poses little risk to the health of the cloud services. Because it is well understood, poses a low risk, and the organization does not stand to benefit from another review, a standard change is preauthorized. For example, resetting a user’s password is a standard task. It still requires approval to ensure it is tracked and not abused, but it does not require the deliberation other changes might.

Emergency changes, as the name suggests, are used in case of an emergency and designate a higher level of urgency to move into operation. Even if the change is urgent, all steps of the process for implementing the change must be followed. However, the process can be streamlined. The review and approval of emergency changes, however, is usually executed by a smaller group of people than is used for a normal change, to facilitate moving the requested change into operation.

Change Proposals

Change proposals are similar to RFCs but are reserved for changes that have the potential for major organizational impact or severe financial implications. The reason for a separate designation for RFCs and change proposals is to make sure that the decision-making for very strategic changes is handled by the right level of leadership within the organization.

Change proposals are managed by the CIO or higher position in an organization. They are a high-level description of the change requiring the approval of those responsible for the strategic direction associated with the change. Change proposals help IT organizations stay efficient by not wasting resources on the intensive process required by an RFC to analyze and plan the proposed change if it is not in the strategic best interest of the organization to begin with.

Change Approval or Rejection

The change manager is the individual who is directly responsible for all the activities within the change management process. The change manager is ultimately responsible for the approval or rejection of each RFC and for making sure that all RFCs follow the defined policies and procedures as a part of their submission. The change manager will evaluate the change and decide to approve the change or reject it.

Change managers cannot be expected to know everything, nor to have full knowledge of the scope and impact of the change, despite the documentation provided in the change request because systems are highly integrated and complex. One small change to one system could have a big impact on another system. For this reason, the change manager assembles the right collection of stakeholders to help advise on the risks and benefits of a given change and to provide the input that will allow the change manager to make the right decision when he or she is unable to decide autonomously.

Change Advisory Board (CAB)   The body of stakeholders that provides input to the change manager about RFCs is known as the change advisory board (CAB). This group of stakeholders should be composed of members from all representative areas of the business as well as customers who might be affected by the change (see Figure 13-2). As part of their evaluation process for each request, the board needs to consider the following:

FIGURE 13-2   The entities represented by a change advisory board (CAB)

Images

Images   The reason for the change

Images   The benefit of implementing the change

Images   The risks associated with implementing the change

Images   The risks associated with not implementing the change

Images   The resources required to implement the change

Images   The scheduling of the implementation

Images   The impact of the projected service outage concerning established SLAs

Images   The planned backout strategy in case of a failed change

While this may seem like a lot of people involved in and much time spent on the consideration of each change to the environment, these policies and procedures pay off in the long run. They do so by limiting the impact of unknown or unstable configurations going into a production environment.

Emergency Change Advisory Board (ECAB)   A CAB takes a good deal of planning to get all the stakeholders together. In the case of an emergency change, there may not be time to assemble the full CAB. For such situations, an emergency change advisory board (ECAB) should be formed. This emergency CAB should follow the same procedures as the standard CAB; it is just a subset of the stakeholders who would usually convene for the review. Often the ECAB is defined as a certain percentage of a standard CAB that would be required by the change manager to make sure they have all the input necessary to make an informed decision about the request.

Images

When implementing a change that requires expedited implementation approval, an emergency change advisory board (ECAB) should be convened.

Change Scheduling

Approved changes must be scheduled. There are some changes that can take place right away, but many must be planned for a specific time and date when appropriate team members are available and when stakeholders have been notified of the change.

Not all changes require downtime, but it is imperative to understand which ones do. Changes that require the system to be unavailable need to be performed during a downtime. Stakeholders, including end users, application owners, and other administrative teams, need to be consulted prior to scheduling a downtime so that business operations are not adversely impacted. They need to understand how long the downtime is anticipated to take, what value the change brings to them, and the precautions that are being taken to protect against risks For customer-facing systems, downtimes need to be scheduled or avoided so that the company does not lose customer confidence by taking a site, application, or service down unexpectedly.

Upgrades may not require downtime, but they could still affect the performance of the virtual machine and the applications that run on top of it. For this reason, it is best to plan changes for times when the load on the system is minimal.

Enterprise systems may have a global user base. Additionally, it may be necessary to coordinate resources with cloud vendors, other third parties, or with support personnel in different global regions. In such cases, time zones can be a large constraint for performing upgrades. It can be difficult to coordinate a time that works for distributed user bases and maintenance teams. For this reason, consider specifying in vendor contracts and SLAs an upgrade schedule so that you are not gridlocked by too many time zone constraints and are unable to perform an upgrade.

Working hours should also be factored in when scheduling change implementation. For example, if a change is to take three hours of one person’s time, then it must be scheduled at least three hours prior to the end of that person’s shift or the task will need to be transitioned to another team member while still incomplete. It generally takes more time to transition a task from one team member to another, so it is best to try to keep this to a minimum.

It is also important to factor in some buffer time for issues that could crop up. In this example, if the change is expected to take three hours and you schedule it exactly three hours before the employee’s shift ends, that provides no time for troubleshooting or error. If problems do arise, the task would be transitioned to another team member who would need to do troubleshooting that might require input from the first team member to avoid rework since the second employee may not know everything that was done in the first place.

If those implementing the change run into difficulties, they should document which troubleshooting steps they performed as well. This is especially important if others will be assisting the individual in troubleshooting the issue. This can also be helpful when working with technical support. This leads us to the next step in the change management process, documentation.

Change Documentation

It is important to keep a detailed log of what changes were made. Sometimes the impact of a change is not seen right away. Issues could crop up sometime down the road, and it helps to be able to query a system or database to view all the changes related to current issues.

After every change has been completed, it must go through a defined procedure for both change review and closure. This review process is intended to evaluate whether the objectives of the change were accomplished, whether the users and customers were satisfied, and whether any new side effects were produced. The review and closure process is also intended to evaluate the resources expended in the implementation of the change, the time it took to implement, and the overall cost so that the organization can continue to improve efficiency and effectiveness of the cloud IT service management processes.

Change documentation could be as simple as logging the data in a spreadsheet, but spreadsheets offer limited investigative and analytical options. It is best to invest in a configuration management database (CMDB) to retain documentation on requests, approvals or denials, and change implementation. The CMDB is discussed in the upcoming “Configuration Management” section.

Change Management Integration

The change process may sound very bureaucratic and cumbersome, but it does not have to be. Integrate change management into your organization in a way that fits with your organizational culture. Here are some ideas.

Users could submit change requests through a web-based portal, which would send an e-mail to the change approvers. Change approvers could post requests to the change approval board on Slack or some other medium to solicit feedback, and then use that feedback to approve or reject the change in the system. Communication on the issue could include a hashtag (a feature of Slack) with the change ID so that it could be easily tracked. Another channel could be used to communicate with stakeholders to schedule the time and resources for the change.

Some workflow can be built into the system to help automate scheduling. You can also use tools to help capture details of changes, such as scripts that dump firewall configurations to the configuration management database when they are made and archival tools to export the CAB discussions to the database. When changes are complete, they can be updated in the same system so that everything is tracked.

Essentially, the entire process can be streamlined and still offer robust checks and balances. Don’t be afraid of change management. Look for ways to integrate it into your company.

Configuration Management

Change management offers value to both information technology organizations and their customers. One problem when implementing change management, however, lies in how the objects that are being modified are classified and controlled. To this end, we introduce configuration management, which deals with cloud assets and their relationships to one another.

The purpose of the configuration management process is to ensure that the assets and configuration items (CIs) required to deliver services are adequately controlled, and that accurate and reliable information about those assets and CIs is available when and where it is needed. CIs are defined as any asset or document that falls within the scope of the configuration management system. Configuration management information includes details of how the assets have been configured and the relationships between assets.

The objectives of configuration management are as follows:

Images   Identifying CIs

Images   Controlling CIs

Images   Protecting the integrity of CIs

Images   Maintaining an accurate and complete configuration management system (CMS)

Images   Maintaining information about the state of all CIs

Images   Providing accurate configuration information

The implementation of a configuration management process results in improved overall service performance. It is also important for optimization of both the costs and risks that can be caused by poorly managed assets, such as extended service outages, fines, incorrect license fees, and failed compliance audits. Some of the specific benefits to be achieved through its implementation are the following:

Images   A better understanding on the part of cloud professionals of the configurations of the resources they support and the relationships they have with other resources, resulting in the ability to pinpoint issues and resolve incidents and problems much faster

Images   A much richer set of detailed information for change management from which to make decisions about the implementation of planned changes

Images   Greater success in the planning and delivery of scheduled releases

Images   Improved compliance with legal, financial, and regulatory obligations with less administration required to report on those obligations

Images   Better visibility to the true, fully loaded cost of delivering a specific service

Images   Ability to track both baselined configuration deviation and deviation from requirements

Images   Reduced cost and time to discover configuration information when required

Although configuration management may appear to be a straightforward process of just tracking assets and defining the relationships among them, you will find that it has the potential to become very tricky as we explore each of the activities associated with it.

At the very start of the process implementation, configuration management is responsible for defining and documenting which assets of the organization’s cloud environments should be managed as configuration items. This is a highly important decision, and careful selection at this stage of the implementation is a critical factor in its success or failure. Once the items that will be tracked as CIs have been defined, the configuration management process has many CI-associated activities that must be executed. For each CI, it must be possible to do the following:

Images   Identify the instance of that CI in the environment. A CI should have a consistent naming convention and a unique identifier associated with it to distinguish it from other CIs.

Images   Control changes to that CI through the use of a change management process.

Images   Report on, periodically audit, and verify the attributes, statuses, and relationships of any and all CIs at any requested time.

If even one of these activities is not achievable, the entire process fails for all CIs. Much of the value derived from configuration management comes from a trust that the configuration information presented by the CMS is accurate and does not need to be investigated. Any activity that undermines that trust and requires a stakeholder to investigate CI attributes, statuses, or relationships eliminates the value the service is intended to provide.

Images

An enterprise IT organization at a large manufacturing company recognized the need to implement an improved configuration management process and invested large amounts of time and money into the effort. With the assistance of a well-respected professional services company leading the way and investment in best-of-breed tools, they believed they were positioned for success. After the pilot phase of the implementation, when they believed they had a good system in place to manage a subset of the IT environment, one failure in the ability to audit their CIs led to outdated data. That outdated data was used to decide on a planned implementation by the CAB. When the change failed because the expected configuration was different than the configuration running in their production environment, all support for configuration management eroded and stakeholders began demanding configuration reviews before any change planning, thus crippling the value of configuration management in that environment.

Configuration Management Database (CMDB)   A CMDB is a database used to store configuration records throughout their life cycle. The configuration management system maintains one or more CMDBs, and each database stores attributes of configuration items (CIs) and relationships with other configuration items.

Record all the attributes of the CI in a CMDB. A CMDB is an authority for tracking all attributes of a CI. An environment may have multiple CMDBs that are maintained under disparate authorities, and all CMDBs should be tied together as part of a larger CMS. One of the key attributes that all CIs must contain is ownership. By defining an owner for each CI, organizations can achieve asset accountability. This accountability imposes responsibility for keeping all attributes current, inventorying, financial reporting, safeguarding, and other controls necessary for optimal maintenance, use, and disposal of the CI. The defined owner for each asset should be a key stakeholder in any CAB that deals with a change that affects the configuration of that CI, thus providing the owner configuration control.

CERTIFICATION SUMMARY

The first part of this chapter covered testing techniques. The ability to test the availability of a cloud deployment model allows an organization to be proactive with the services and data that it stores in the cloud.

Automation and orchestration are two ways to reduce time spent on tasks, increase the speed of technological development, and improve efficiencies. Automation uses scripting, scheduled tasks, and automation tools to programmatically execute workflows that were formerly performed manually. Orchestration manages automation workflows to optimize efficiencies and ensure effective execution. Orchestration integrates organizational systems to provide more value to the organization. In orchestration, workflow automations are called runbooks, and each discrete step in a runbook is called an activity.

The chapter ended with a discussion on change and configuration management. A change is defined as the addition, modification, or removal of anything that could affect cloud services. Configuration management, on the other hand, is concerned with controlling cloud assets and their relationships to one another through configuration items (CI).

KEY TERMS

Use the following list to review the key terms discussed in this chapter. The definitions also can be found in the glossary.

approval process   The set of activities that present all relevant information to stakeholders and allows an informed decision to be made about a request for change.

asset accountability   The documented assignment of a configuration item to a human resource.

automation   The programmatic execution of a workflow that was formerly performed manually.

backout plan   An action plan that allows a change to be reverted to its previous baseline state.

baseline   A set of metrics collected over time to understand normal data and network behavior patterns.

change advisory board (CAB)   The body of stakeholders that provides input to the change manager about requests for change.

change management   The process of making changes to the cloud environment from its design phase to its operations phase in the least impactful way possible.

change monitoring   Process of watching the production environment for any unplanned configuration changes.

configuration control   The ability to maintain updated, accurate documentation of all CIs.

configuration item (CI)   An asset or document that falls within the scope of the configuration management system.

configuration management (CM)   The process that ensures all assets and configuration items (CIs) required to deliver services are adequately controlled, and that accurate and reliable information about them is available when and where it is needed, including details of how the assets have been configured and the relationships between assets.

configuration management database (CMDB)   The database used to store configuration records throughout their life cycle. The configuration management system maintains one or more CMDBs, and each database stores attributes of configuration items and relationships with other configuration items.

configuration standardization   Documented baseline configuration for similar configuration items (CIs).

data integrity   Assurance that data is accurate and that the same data remains unchanged in between storage and retrieval.

emergency change advisory board (ECAB)   The body of stakeholders that provides input to the change manager about requests for change in the case of an emergency when there may not be time to assemble the full CAB.

high availability (HA)   Systems that are available almost 100 percent of the time.

load balancing   Running services on multiple machines to share the burden of processing requests and servicing end users.

load testing   Testing that evaluates a system when the system is artificially forced to execute operations consistent with user activities at different levels of utilization.

orchestration   The management and optimization of automation workflows.

remote transactional monitoring   A system that simulates user activity to identify how long it takes to perform each task.

replication   Copying data between two systems so that any changes to the data are made on each node in the replica set.

request for change (RFC)   A formal request to make a modification that can be submitted by anyone who is involved with or has a stake in that particular item or service.

runbook   A workflow automation that can be used in orchestration tools.

stress testing   A form of load testing that evaluates a system under peak loads to determine its max data or user handling capabilities.

system logs   Files that store a variety of information about system events, including device changes, device drivers, and system changes.

testing   A proactive measure to ensure consistent performance and operations of information systems.

Images TWO-MINUTE DRILL

Testing Techniques

Images  A baseline can be compared with actual performance metrics at any point following collection of the baseline to determine if activity represents the norm.

Images  Performance testing evaluates the ability of a system to service requests in a timely manner. It uses performance metrics to track utilization of resources based on demand. Demand is simulated in load testing, stress testing, and remote transactional monitoring.

Images  Configuration testing allows an administrator to test and verify that the cloud environment is running at optimal performance levels.

Images  Testing in the cloud landscape utilizes cloud resources as a sandbox to test new applications or new versions of applications without affecting the performance or even the security of the production landscape. Testing should also confirm that the application or system can effectively perform the functions it was designed for.

Images  Testing should also confirm that SLA objectives can be met; that HA is implemented correctly; that sizing, replication, and load balancing have been correctly implemented; and that data integrity is not harmed by the application.

Automation and Orchestration

Images  Automation and orchestration are two ways to reduce time spent on tasks, increase the speed of technological development, and improve efficiencies.

Images  Automation uses scripting, scheduled tasks, and automation tools to programmatically execute workflows that were formerly performed manually.

Images  Orchestration manages automation workflows to optimize efficiencies and ensure effective execution. In orchestration, workflow automations are called runbooks, and each discrete step in a runbook is called an activity.

Change and Configuration Management

Images  Change management is the process of making changes to the cloud environment from its design phase to its operations phase in the least impactful way. Change management tracks changes to make auditing and future troubleshooting easier. It also ensure that change decisions follow due diligence and that all required tasks such as risk analysis and stakeholder notification are completed.

Images  Configuration management ensures that the assets and configuration items (CIs) required to deliver services are adequately controlled, and that accurate and reliable information about those assets and CIs is available when and where it is needed.

Images SELF TEST

The following questions will help you measure your understanding of the material presented in this chapter. As indicated, some questions may have more than one correct answer, so be sure to read all the answer choices carefully.

Testing Techniques

1.   Which configuration test measures the amount of time between a networked device’s request for data and the network’s response?

A.   Network bandwidth

B.   Network latency

C.   Application availability

D.   Load balancing

2.   Which of the following items is not included in a baseline?

A.   Performance

B.   Vulnerabilities

C.   Availability

D.   Capacity

Automation and Orchestration

3.   Which of the following would be used in orchestration to execute actions to automatically perform a workflow?

A.   Simulator

B.   Workplan

C.   Runbook

D.   Scheduled Task

4.   Which of the following is a scripting language?

A.   Cobol

B.   C++

C.   Java

D.   PowerShell

Change and Configuration Management

5.   Which of the following are objectives of change management? (Choose all that apply.)

A.   Maximize business value.

B.   Ensure that all proposed changes are both evaluated and recorded.

C.   Identify configuration items (CIs).

D.   Optimize overall business risk.

6.   Which of the following are objectives of configuration management? (Choose all that apply.)

A.   Protect the integrity of CIs.

B.   Evaluate performance of all CIs.

C.   Maintain information about the state of all CIs.

D.   Maintain an accurate and complete CMS.

7.   Dieter is a systems administrator in an enterprise IT organization. The servers he is responsible for have recently been the target of a malicious exploit, and the vendor has released a patch to protect against this threat. If Dieter would like to deploy this patch to his servers right away without waiting for the weekly change approval board meeting, what should he request to be convened?

A.   ECAB

B.   Maintenance window

C.   Service improvement opportunity

D.   CAB

Images SELF TEST ANSWERS

Testing Techniques

1.   Images   B. Testing network latency measures the amount of time between a networked device’s request for data and the network’s response. Testing network latency helps an administrator determine when a network is not performing at an optimal level.

Images   A, C, and D are incorrect. Network bandwidth is the measure of throughput and is impacted by latency. Application availability is something that needs to be measured to determine the uptime for the application. Load balancing allows you to distribute HTTP requests across multiple servers.

2.   Images   B is correct. Vulnerabilities are discovered in vulnerability management, which is not a function of baselining. Organizations may track the number of vulnerabilities and remediation of those vulnerabilities, but as a business metric, not a baseline. A baseline is used to better understand normal performance so that anomalies can be identified.

Images   A, C, and D are incorrect. Baselines commonly include performance, availability, and capacity metrics to gain an understanding on how the system normally operates.

Automation and Orchestration

3.   Images   A runbook is a workflow automation that can be used in orchestration tools.

Images   A, B, and D are incorrect. A workflow simulator would not actually execute the actions as the question requires, it would only test execution. A workplan is a document describing the work that goes into a project. It is not used to execute actions in a workflow. A scheduled task may be used to run an action but this does not have the level of intelligence required in orchestration nor does it have the visualization features that orchestration has.

4.   Images   D. PowerShell is a scripting language for Windows.

Images   A, B, and C are incorrect. Cobol is a mature programming language primarily used on mainframe and legacy ERP systems. C++ is an object-oriented programming language, and Java, not to be confused with JavaScript, is a platform-independent programming language.

Change and Configuration Management

5.   Images   A, B, and D are correct. Maximizing business value, ensuring that all changes are evaluated and recorded, and optimizing business risk are all objectives of change management.

Images   C is incorrect. Identification of configuration items is an objective of the configuration management process.

6.   Images   A, C, and D are correct. The objectives of configuration management are identifying CIs, controlling CIs, protecting the integrity of CIs, maintaining an accurate and complete CMS, and providing accurate configuration information when needed.

Images   B is incorrect. Evaluation of the performance of specific CIs is the responsibility of service operations, not configuration management.

7.   Images   A. Dieter would want to convene an emergency change advisory board (ECAB). The ECAB follows the same procedures that a CAB follows in the evaluation of a change; it is just a subset of the stakeholders that would usually convene for the review. Because of the urgency for implementation, convening a smaller group assists in expediting the process.

Images   B, C, and D are incorrect. A maintenance window is an agreed upon, predefined time period during which service interruptions are least impactful to the business. The requested change may or may not take place during that time frame based on the urgency of the issue. Service improvement opportunities are suggested changes that are logged in the service improvement register to be evaluated and implemented during the next iteration of the life cycle. Life cycle iterations do not happen quickly enough for an emergency change to be considered even as a short-term service improvement item. CAB is close to the right answer, but based on the urgency of this request, Dieter likely could not wait for the next scheduled CAB meeting to take place before he needed to take action. The risk of waiting would be greater than the risk of deploying before the CAB convenes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.62.105