Chapter 11. IT Operations

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11. IT Operations

The IT operations organization is responsible for maintaining a secure and reliable production environment. In large organizations, operations often resembles a small army with too many divisions to navigate that is also often held responsible when things go wrong. Developers, working on the bleeding edge of technology, often regard their colleagues in operations as lacking technical skills and ability, which is true in so far as operations resources tend to focus more on the day-to-day running of the systems. In this chapter, we will discuss how to create an effective IT operations group that is aligned with your agile ALM.

11.1 Goals of IT Operations

The goal of IT operations is to ensure that your production systems are secure and reliable. Obviously, service interruptions can adversely affect the organization. If your business is to be successful, then your IT operations must be staffed with skilled resources and your processes must be able to handle the business demands while avoiding errors. We view IT operations to be a strong partner with development, which should also be aligned with agile principles and practices. Although IT operations often has its own terminology, usually based upon the ITIL v3 framework, we have found it helpful to encourage our colleagues in operations to learn agile concepts and terminology. There is much synergy between ITIL and agile and also much value in the operations group being able to share in the journey to agile development. As we have been discussing, DevOps teaches us that IT operations should embrace the goal of effective communication and collaboration with developers and other stakeholders within the organization.

11.2 Why Is IT Operations Important?

Without effective IT operations, your organization will suffer from service interruptions that will adversely affect your business and profitability. Effective IT operations will empower your organization to be able to meet and exceed your customers’ expectations while maintaining a high level of security and reliability. Too many people view IT operations as being an afterthought in the agile ALM. We view operations as being an essential part of an effective and mature agile ALM, especially in organizations that want to embrace DevOps. It is also important for you to decide which of these functions and practices are needed by your organization. You will find that many of them go onto a list to be implemented later as the project matures. Make note of that because the time will come faster than you think, and you want to be prepared to enable the structure you need when you need it.

11.3 Where Do I Start?

Industry standards (e.g., IEEE, ISO) and frameworks (e.g., ITIL, Cobit, CMMI) provide an excellent blueprint for a comprehensive and efficient approach to IT operations. As always, we like to start by assessing existing practices. Start by getting input from key stakeholders on what is being done well and what could be improved. Your assessment should compare current practices to the guidance that you find in the well-respected ITIL v3 framework. Standards from the IEEE and ISO may be important as well, depending upon the industry that your organization falls within. We always start by documenting the existing practices “as-is,” then use industry standards and frameworks to define “to-be,” and finally create the plan to improve (or establish) the IT operations processes. One other key point should be mentioned here; we always start small. By picking a few easy-to-achieve items to improve first, your team will realize that change is not only possible, but within reach. Once your team realizes that change is attainable, the organizational culture will improve, and you will likely see real progress.

There have been many dramatic incidents in the news recently demonstrating the catastrophic results of operations errors, especially in terms of failed upgrades to complex trading systems.

It Happened at Knight

The Knight Capital Group (KCG) was a financial services firm that was engaged in high-frequency trading. KCG suffered a devastating trading loss in August 2012 after an application upgrade mistakenly caused the firm to purchase stock that it did not want, ultimately leading to a trading loss that grew to 460 million dollars. The firm was ultimately acquired by KCG holdings. This was a dramatic example of how an operation upgrade error could literally lead to a company going out of business.

Operations plays a key role in identifying and mitigating sources of risk. Establishing a robust environment monitoring system can provide a much-needed early warning system that makes the difference between detecting and addressing potential problems and suffering a catastrophic outage such as what happened at Knight Capital Group.

As you read through each of these practices, consider whether you will need them now, as the project nears delivery, or when the system is fully in production. You cannot implement every one of these functions overnight. You cannot even fix them all at once, but you can get a comprehensive list of process improvement initiatives and implement them in alignment with your project and organizational needs.

11.4 Monitoring the Environment

Environment monitoring is often a function that is overlooked in many organizations. Understanding runtime dependencies and associated events is actually a key requirement to ensure that your systems do not suffer serious issues that could result in outages. Event monitoring can be an essential first step to preventing such mishaps.

11.4.1 Events

Events can be as simple as alerts associated with finite resources such as memory or as complex as application or operating system resources, often only really understood by the programmer writing the code. We have seen many situations where even senior developers did not fully understand the development frameworks within which they are working. If developers themselves struggle to understand dependencies, then obviously operations must get involved from the beginning of the development process to capture the information needed to support complex IT systems and ensure reliable service.

Hedge Fund Trading Systems

We have seen complex trading systems written using frameworks such as Microsoft .Net that have so much complexity that even the developers writing the code did not fully understand the underlying dependencies. When these systems operate under extreme situations, involving millions of transactions, it is often difficult to really understand all of the underlying dependencies. In practice, we identify what we know in advance and then continuously improve our knowledge base, sometimes as the result of an incident or problem.

We find ourselves defining which events need to be monitored based upon what we see developers using to diagnose problems. What is interesting about this knowledge management challenge is that most technology professionals cannot define these constraints up-front, but immediately point them out when troubleshooting—for example, shutting down a Tomcat web application, followed by checking processes to verify that they have in fact terminated as expected, or noting a job that normally takes 20 minutes suddenly is completing in seconds, which may indicate that the job did not actually process successfully. Obviously, identified exceptions should be monitored, but we are more often dealing with the as-yet-undefined events to be monitored. Event monitoring requires an effective DevOps approach in which the operations team works closely with the developers who wrote the code to identify exactly what can be monitored versus what should actually be reported in the form of alerts or exceptions.

Two closely related monitoring capabilities are filtering events and correlating events across systems.

Troubleshooting Disk Space Shortage

We were troubleshooting an Atlassian system that was showing disk space errors in the logs. A quick check of the application server showed that we had plenty of space, which led us to contact the DBAs, who initially denied that there was any problem with the Oracle database. The problem turned out to be a storage volume on the Oracle database server. We would not have thought to even check on this if we had not first seen errors in the logs. Event monitoring can be a key strategy in identifying problems that must be addressed before more serious problems occur.

Monitoring events is crucial for ensuring reliable systems and avoiding costly service interruption, but sometimes incidents do occur. What matters most is how you identify and address these challenges when they occur.

11.4.2 Incidents

Bad things happen. Organizations succeed or fail based upon their ability to recognize, assess, and respond to unexpected and unplanned outcomes. Successful organizations pull together and really show what they are made of when incidents occur and everyone on the team responds to the call for “all hands on deck.” You should always have a dedicated team responsible for organizing the response and communicating with all stakeholders for when incidents inevitably occur. This team is usually called the critical incidence response team (CIRT). The CIRT prepares for incident response by identifying stakeholders who should be notified and those who are on call to immediately address the response itself.

Intimidation and Incident Response

Bob participated in the incident response team for a large New York City–based trading firm. When bad things happened, a very senior manager would chair the incident response team, which consisted of representatives from each of the affected areas. This manager had a strong command-and-control demeanor, often bordering on extreme abruptness. Members of the team were afraid to speak up and offer their views for fear that they would be publicly criticized and even blamed for the outage.

Some incidents can be addressed and resolved in a timely manner. This is usually when the cause of the incident is well understood and the steps required to address the issue are easily identified and assignable to team members to complete. However, sometimes the underlying cause of a problem is not immediately evident. When root-cause analysis is required, a wider and more in-depth process, usually known as problem management, is needed.

11.4.3 Problems

Problem management is most often associated with the need for root-cause analysis. Sometimes, the immediate issue has already been addressed and systems are back online, but perhaps there is a concern that the issue could occur again. Sometimes, this is due to faulty hardware that needs to be examined by the vendor—and may even require plans for future upgrades in response to increased capacity demand.

We also see situations where problems with the application occur and there needs to be further investigation to avoid similar problems in the future.

Reboot or Not?

Although trivial, we often see situations where a system is malfunctioning and we suspect that rebooting the machine will solve the problem—and it often does. However, rebooting the machine may also make it harder to evaluate the root cause of the problem and means that we will be back on the CIRT next week again. There is always a judgment call to be made on whether we continue triaging and investigating or we “punt,” reboot the machine, and get our users back online. There are no easy answers in these situations. Make sure to capture any available logs, renaming them to avoid confusion. With vendor products, we usually have a script that we run to capture the state of the machine before the reboot, and we send that information to the vendor for evaluation. You may want to create a similar support tool for your application and systems investigation.

Incident and problem support are closely related and usually involve many of the same stakeholders. In some organizations, there is a dedicated production support function which consists of developers who take responsibility for the maintenance and hands-on support for legacy applications, whereas other developers may be engaged in writing the next generation of applications.

11.5 Production Support

Some organizations have a separate function called production support, which is responsible for maintaining and supporting applications that are in use. These systems are often legacy applications requiring specialized skill sets that might be difficult to acquire. Whereas some technology professionals are glad to stay within their comfort zone, even if the legacy technology is older and not as much in demand, many developers avoid working with these older applications. Production support is often responsible for managing application patches and even system upgrades. We have worked with production support engineers who were highly skilled technology professionals with strong technical backgrounds, including software and systems development. It could be argued that production support is inherently a DevOps function because it is usually placed in the operations organization, but staffed with technology professionals who have the skills and expertise to maintain production systems.

We see production support sometimes being outsourced, or even offshored, sometimes with mixed results. Organizations need to make prudent decisions about the value of their legacy applications and the cost of a service interruption.

Who Needs Mainframe Programmers?

We recall one organization that decided there was very little mainframe development going on and therefore decided to eliminate their mainframe programmers, offshoring the entire production support function to a vendor in India. Managers congratulated themselves on their wise decision to save money by eliminating their expensive onshore support analysts who had been with the company for 20+ years. Most of these colleagues took their exit packages and “cried” all the way to the bank. But the real tears were shed the next time an outage occurred and it became evident that the offshore team lacked the skills and expertise to resolve the problems in a timely manner.

Production support should be focused on maintaining the necessary institutionalized knowledge essential for maintaining legacy applications. This information is often highly specialized and difficult to come by. The production support function should also be focused on developing support tools, including environment monitoring, to ensure uninterrupted service. It is not uncommon for the production support team to take responsibility for the creating and maintaining of the deployment pipeline (which we discuss throughout this book, including Chapter 9). Production support relies heavily upon the information that they receive either in terms of enhancement requirements or reports of defects. The help desk is often the first line of defense in gathering this information and reporting problems to the production support team.

11.6 Help Desk

As mentioned, the help desk is often your first line of defense and the difference between your customers having a positive impression of your organization and becoming totally annoyed and determined to take their business elsewhere. Customers and end users can often be a tolerant group of people. Even nontechnical end users understand that sometimes systems go “bump in the night.” When bad things happen, the response from the help desk will directly preserve customer satisfaction or conversely lead to frustration and anger. Think about your own experiences contacting a help desk. When you feel that information is gathered professionally and you are being kept well informed, you are much more likely to form a positive impression and tolerate even significant service interruptions. We have seen help desks save the day with their professionalism and customer service.

It is common for service centers to have a centralized office for the help desk, and there is certainly much value in being able to train and manage your help desk team in a collocated environment. But we also see a growing trend toward virtual help desks, which may be spread across many locations, often taking a follow-the-sun approach, handing off open tickets from one group to another.

11.6.1 Virtual Help Desks

Many organizations create help desks that are physically located in different locations and effectively follow the sun. Sometimes, these help desks are also organized by function. So your DBAs may be located in India, whereas your network operations team is in Europe. Virtual help desks should always provide a common interface and consistent communication, regardless of where they are physically located.

Following the Sun from Jerusalem

Bob likes to work from various locations, especially in the Middle East. On one recent trip, he discovered the advantages of being located in that time zone. Mornings were a convenient time to coordinate with both European and India-based resources, and evenings were dedicated to working with U.S.-based teams. Thanks to afternoon naps, Bob found himself well rested and able to facilitate communication more productively than when he was in the United States. More than a few people commented that it seemed as if he were online around the clock.

Closely related to virtual help desks is remote work. We see that many companies hold on to highly skilled employees by offering them flexible work-from-home arrangements.

11.6.2 Remote Work

Many technology professionals, including help desk analysts, are finding that working from home on a part-time or full-time basis helps them maintain a quality of life that would be inaccessible if they needed to be in the office every day. Working mothers (and dads), too, are among those who often appreciate flexible work arrangements. Salaries are typically lower for remote positions, and we have seen situations where highly skilled employees were given more flexibility to work from home in lieu of more compensation. One colleague sold his home and moved into his vacation home year-round. Obviously, there is some saving on commuting costs and perhaps some other incidentals. There is also, however, the risk of isolation and a lack of face-to-face interactions. There also may be a lack of upward mobility relative to that afforded to colleagues who are in the office and interacting with senior management on a daily basis. We have also heard some folks complain that working remotely on a help desk can get boring and repetitive. Although not yet a widespread practice, we have heard of some organizations experimenting with gaming and virtual worlds to address this challenge.

11.6.3 Virtual World Help Desk

We have reviewed some clever help desk interfaces that transform the monotonous help desk function into participating in a cool virtual world interface and others that simulate games. We see this approach becoming more prevalent in the coming years.

There are times when help desk resources do not have the required expertise or perhaps intimate familiarity with the application to solve the problem. At times like these, it may be necessary to have developers participate in the help desk activities.

11.6.4 Developers on the Help Desk

We will discuss help desk escalation later in this chapter. For now, we will note that sometimes help desk staff must reach out to and pull developers into the help desk to provide their intimate knowledge of the system in order to assess, evaluate, and resolve issues. Getting experts involved is important, but ensuring that their expertise becomes part of the permanent knowledge base is where IT process automation becomes a valuable asset.

11.7 IT Process Automation

IT process automation helps capture specific steps to assess, evaluate, and respond to specific help desk requests. This approach helps by capturing and automating the necessary technical steps that are typically performed by skilled specialists who may not always be available to assist with responding to help desk issues and also routine requests. IT process automation is often used to handle common access requests such as unlocking user accounts, but also is useful in dealing with more complicated challenges. Some of these requests can be handled by help desk engineers, but most often IT process automation shows its real value when issues have to be escalated to specialized experts who are not always available. Closely related is knowledge management.

11.7.1 Knowledge Management

With the right experts available, almost any problem can be diagnosed and addressed. The real challenge is that skilled resources may be busy with other tasks or may have gone on to other projects. Help desk personnel need to constantly capture and store the steps necessary to identify issues and then fix them.

The Accidental AIX Admin

Bob was working at an international bank when the entire AIX administration team all decided to resign on the same day. AIX is IBM’s robust and complicated open-standards-based Unix operating system, and the AIX environment was rather complicated when it came to supporting the particularly complex DFS¹ storage systems. Bob had no prior experience with AIX, but was able to perform this function successfully using both the group’s internal knowledge base and IBM’s extensive knowledge resources, along with their online support services.

1. Often called the DCE Distributed File System (DFS).

IT process automation captures technical steps. Equally important is a workflow automation tool to ensure that processes are repeatable and fully traceable.

11.8 Workflow Automation

Workflow automation can be used to identify the required steps and then guide the full lifecycle of a request, ensuring that all issues are handled in a timely manner while implicitly capturing important information about how issues are evaluated, diagnosed, and addressed.

What Process Do We Need?

We come across situations where even the subject matter experts are not completely certain how the process should be specified. They may also have difficultly verbalizing the steps, including checkpoints, in a specific sequence. This is where workflow automation tools that allow you to visualize the steps and then iteratively improve the process are worth their weight in gold. When in doubt about the exact sequence required, we put up a draft process, as we have learned that many people find it easier to tell us what is wrong with the proposed process than to tell us the required steps up-front.

Workflow automation tools also help by ensuring that issues and requests do not fall through the cracks and are automatically escalated if not acted upon within a specific timeframe. Most importantly, they facilitate traceability and communicating status to all stakeholders.

11.9 Communication Planning

Communications planning is a fundamental function within any organization and one that is often not handled effectively. All too often, incidents occur and then we scramble to figure out who should be notified. It is essential to create a comprehensive communications plan to notify key stakeholders affected by specific incidents.

When the Centralized Version Control System Goes Down

We are often consulted when the centralized version control system is impacted by required maintenance or actually has a serious issue that interrupts service. Planned maintenance can often be handled by communicating with project managers, who disseminate information as they see fit, along with similar notifications for other systems. But when systems affecting one or two thousand developers go down, we often have to make an immediate e-mail notification to all impacted stakeholders. Most important is notifying our colleagues as to current status, ETA for problem resolution, and any steps that they may be able to take in the meantime to minimize disruption. Communication is always essential.

Although effective communication is very important, poor communication can be particularly disruptive. In organizations that suffer from dysfunctional communication patterns, immediate powerful interventions need to be taken to address and deal with these behaviors.

11.9.1 Silos within the Organization

The most common problem we come across is groups that have formed a dysfunctional, insular culture where they feel all information should stay within the group. Siloed behavior causes significant disruptions, builds resentment, and represents the antithesis of DevOps principles and practices. Unfortunately, siloed behavior is also very common in many organizations. The key is to understand the root cause of this dysfunctional culture.

DBAs in Secret

We have come across database administrators who act as if they are working for a separate company, refusing to communicate and collaborate with their colleagues and insisting that their databases are just fine. We dealt with a systems outage that pointed back at the DB2 database and noticed a couple of files that had mysteriously been modified. We consulted with a couple of the DBAs, who insisted that nothing had been changed. Curiously enough, soon after these discussions, the files were again modified and the database started working again. The problem was fixed as mysteriously as it had occurred. Make sure that you implement our best practices around watching and being notified of unauthorized changes. We knew who had dropped the ball, but our colleague was too embarrassed to admit his mistake.

We come across groups that seem completely unable to communicate with anyone outside of their team. In some cases, the team is in a different location, as is common with offshore support groups. Sometimes, we find that team members are sensitive about their language skills, so they effectively shut down and do not try to communicate. Far worse, we also see teams that have stopped communicating because of an overdomineering manager who is quick to criticize and insists on micromanaging all activities of the team. In these cases, you have two choices. You can either reduce the team’s influence or have them focus only on specific well-known tasks (communicating through a workflow automation tool), or you can get the team a new manager.

Effective communication is essential, especially when situations have to be escalated to involve a wider group to help address a problem.

11.10 Escalation

When incidents and problems are not being resolved in a timely manner, then you need to have a plan to escalate the response from the team. Obviously you want to start with the first line of defense, which is typically called level 1.

11.10.1 Level 1

As mentioned, the first line of defense is normally called level 1 and consists of help desk engineers who are capable of quickly assessing challenges and coming up with quick fixes. These professionals are usually quite capable at determining what type of problem has occurred and might even know the solution (often by consulting an internal knowledge base). When a particular issue cannot be resolved immediately, then the level 1 engineer typically packages up the available information and refers to more experienced (and specialized) engineers, who are typically known as level 2 engineers.

11.10.2 Level 2

Aside from being highly skilled, these folks typically have specialized resources available to them, including consulting with the developers who wrote the code, who otherwise may be kept isolated from the end user so that they can focus on developing new features and products. Level 2 engineers are typically specialists and develop a deep level of expertise within a specific area. When the level 2 engineer cannot resolve the issue with all of his or her available resources, then this person typically seeks input from the developers who wrote the code.

11.10.3 Level 3

Level 3 usually refers to the developers who wrote the code and have the most specialized knowledge. They also typically know backdoor techniques to diagnose and resolve problems, often in the form of undocumented RESTful API calls, which produce information that can help diagnose and resolve problems. If nothing else, these folks are usually empowered to identify an issue as being a defect that must be addressed in a future release.

House Calls and Level 3

Bob was an early adopter of a large-scale robust enterprise version control system. But shortly after implementation, a defect was discovered, which went through the vendor’s escalation process. Once the problem was identified, the vendor actually flew the developer out to the customer site. The developer explained the exact cause of the issue and handed over a CD-ROM with a fix. Bob was then Cc’d on all correspondence related to fixing this defect in a future release of the product. That was excellent customer service, and a day with the developer resulted in a deep dive into exactly how the product worked, which was valuable for future support. No one was angry that the product had a defect. In fact, everyone was thrilled to discover just how responsive and committed to providing excellent service the vendor was.

Escalation is, by its very nature, an excellent example of taking a DevOps approach. In fact, IT operations needs to adopt DevOps best practices even within the IT operations group itself.

11.11 DevOps

DevOps is discussed in detail in Chapter 12 and is a common theme throughout this book. That said, DevOps within the IT operations bears some specific consideration. We find that many enterprise IT operations groups are so siloed that they really need to focus specifically on improving collaboration and communication within the IT operations organization itself. We also find that operations sometimes goes off and builds infrastructure without involving the development team in the effort. It has been our experience that this usually occurs when the IT operations wants to build up their own expertise before bringing in the development team, often after being criticized by the developers for not being technical enough. We recently saw this lack of communication when the operations team picked one workflow orchestration tool and the developers picked a different tool for deployment automation. Each organization spent time and money implementing a solution without involving the other. Our only recourse was to define one part of the deployment process that would be handled by the operations tool and the other part using the tool chosen by the developers.

11.12 Continuous Process Improvement

IT operations processes must be continuously reviewed and improved. The best approach is to adopt a culture of continuous process improvement. Although many operations organizations will survey the business and end user for feedback, we rarely see them asking the developers for their input, which is most unfortunate. Developers are indeed “customers” of IT operations and should be part of the continuous process improvement initiatives.

We like to encourage assessments of existing practices before trying to improve the way that we do things. As discussed previously, we like to survey stakeholders, asking what is being done well and what could be improved. The assessment should take the form of documenting “as-is” practices and then planning for improvements or “to-be” practices based upon industry best practices. One excellent way to guide this effort is by using industry standards and frameworks.

11.13 Utilizing Standards and Frameworks

Industry standards and frameworks provide an excellent starting point for determining best practices. ITIL v3 has become well regarded by both corporations and government agencies. Depending upon your industry, the ISACA Cobit framework may also be valuable (especially for showing compliance with Section 404 of the Sarbanes-Oxley Act of 2002). Although less popular in business, there are still some areas where the CMMI is well accepted. We acknowledge our preference for using the guidance in the IEEE and ISO standards. (In full disclosure, it should be noted that Bob has had a long-time involvement with the IEEE Software Standards board).

Who Needs Standards Anyway?

Too often, our colleagues decide that industry standards and frameworks are not practical. With a little probing, we usually find out that our friends have not spent much time really reading and understanding these documents. Although we agree that sometimes industry standards and frameworks must be tailored for specific situations, we usually find that they are a great starting point, and their use helps to avoid missing key steps that might otherwise be overlooked. They also have the powerful value of being very credible in terms of showing compliance with common regulatory and audit requirements.

The ITIL v3 framework contains comprehensive guidance on establishing all of the functions and processes required for an effective IT operations organization. In the next section we will briefly review the ITIL v3 processes that are most relevant for establishing DevOps practices for application deployment.

11.13.1 ITIL v3

The ITIL v3 release control and validation (RCV) framework describes service management processes. It begins with the need to establish an effective service strategy that leads to a detailed service design, which is described in the service design package. The ITIL v3 framework describes the following RCV processes that transition new services from design into operation:

• Change management

• Service asset and configuration management

• Release and deployment management

• Service testing and validation

• Change evaluation

• Request fulfillment

• Knowledge management

As discussed in Chapter 10, change management processes evaluate, authorize, and implement changes to services that typically occur when you implement new components or change existing components. Change management processes are designed to

• Evaluate potential downstream effects from a change

• Reduce risk

• Improve communication to all stakeholders

The change advisory board (CAB) is the governing body that reviews each proposed change. Change management drives the entire release control and validation process. Effective change management balances risk with the need to implement new changes that deliver value to a business and make it competitive. Change management relies on workflow automation and accurate, up-to-date information on assets and configuration management baselines.

Service asset and configuration management (SACM) processes manage the software assets and maintain accurate information about configuration items. These processes create and manage baselines that track the status of changes to each configuration item. SACM processes provide all of the other processes and functions, such as the CMDB and the CMS, with accurate and up-to-date information on the status of configuration baselines.

Each software asset needs an owner who is responsible for the asset and can identify the subject matter experts who can accurately assess and evaluate the potential downstream effect of a change.

It is important to understand the interfaces between configuration items. SACM tracks changes to baselines. This tracking ability is essential for traceability. For many organizations, the tracking is required by compliance with federal regulatory and audit controls. With up-to-date information, it is much easier to operate the business, to ensure agility to implement desired changes, to conduct change planning, and to respond to incidents when they do occur. In many ways, SACM is the “glue” that holds together the other release control and validation functions. The activities in SACM include the following:

• Management and planning

• Configuration identification

• Configuration change control

• Status accounting

• Verification and audit

The software configuration management plan (SCMP) provides the strategy for how to handle all of the activities required for successful application build, package, and deployment phases of code development. The planning section can simply specify the schedule for the release iterations, or it can specify every aspect of the release and deployment process. The SCMP traditionally consists of four classic functions: configuration identification, status accounting, configuration change control, and configuration audit. Specify the following information for each of these four functions:

• Configuration identification: Specifies a naming convention for each configuration item and helps ensure that you select the correct configuration items for each release.

• Configuration change control: Includes specific procedures to manage changes to configuration items, to manage new releases, and to retire configuration items.

• Status accounting: Documents the path of a configuration item from its initial creation to end of life. Status accounting includes the status of configuration baselines and configuration items as they are developed.

• Verification and audit: Helps ensure that the correct configuration items are deployed. The audit information can be independently verified. Organizations depend on well-defined processes to manage release and deployment.

RDM processes focus on the activities required to build, package, deploy, and test the new services. RDM creates detailed plans that help ensure that configuration items can be successfully built and tested. The plans emphasize automated procedures that are repeatable and verifiable. RDM makes it easier for stakeholders to understand what is being done and increases the likelihood that they are satisfied with the results as the service transitions from design to operations.

RDM prepares the build, including automated procedures, and helps ensure that testing is conducted and that the test environments are coordinated. RDM processes include tasks related to the initial pilot and early life support of the service that is being transitioned into operation. After the RDM verifies these steps, it reviews and closes the transition effort.

Service validation and testing helps ensure the IT service is fit for purpose in that it matches the requirements included in the design and also that it is fit for use in that it meets the requirements that are needed for its intended use.

The service validation and testing process also helps ensure that the service design package correctly specifies the requirements. In addition, it helps ensure that the service provides value to the business, that it performs well in production, and that it meets the target level of quality.

This process focuses on the activities required to

• Create and validate test plans

• Manage the test plans and test environments

• Conduct the test

• Verify the results

• Communicate the results to stakeholders

• Create test reports

• Evaluate the test according to exit criteria

After the test phase, you need to evaluate the change to ensure it delivers value to the business. Change evaluation helps us determine whether the change meets the client’s needs in terms of how the updated code is to be used and whether it delivers the expected level of service. Change evaluation ensures that you understand the intended and unintended effects of a change. Change evaluation monitors the predicted performance and manages risk.

Request fulfillment is part of operations and focuses on the process to complete routine requests. Request fulfillment streamlines the completion of routine requests to free resources to focus on more demanding and nonroutine requests.

11.13.2 Knowledge Management

The lifecycle processes covered under release control and validation provide the essential knowledge that is necessary to support reliable and effective services. The knowledge management process captures a wide array of information that can help drive the entire process. The configuration management database and the configuration management system are part of the service knowledge management system.

The ITIL v3 framework is certainly a well-defined and widely accepted framework. Many organizations are also obliged to follow the guidance in the ISACA Cobit framework.

11.13.3 ISACA Cobit

The ISACA Cobit framework consists of high-level IT controls, including change and configuration management. The control objectives provide guidance on how to establish many of the same controls as described in the ITIL v3 framework and the IEEE and ISO standards. Cobit is most closely associated with compliance with Section 404 of the Sarbanes-Oxley act of 2002, although we have had healthcare companies tell us that the guidance is also well aligned with HIPAA and CFR 21.

We view these well-documented industry best practices as a key starting point for defining how your IT controls should be established to guide all aspects of your IT operations organization.

11.14 Business and Product Management

IT operations must keep a constant focus on the business and product management objectives, especially in terms of understanding risk management and the need for fast and reliable service. There are many times when a discussion with the business and product management professionals can provide a specific perspective that may alter the approach we take to provide IT operations services. The most common adjustment is that business would often prefer to take risks than slow down the rate of change. If the risks are well understood and communicated, we often find that our business and product colleagues are the first to encourage us to take calculated risks that can result in greater profitability.

Business and product management professionals provide valuable insights. It is also essential for IT operations to have strong technical leaders to help interface across the organization.

11.15 Technical Management

Strong technical managers can provide the glue that helps the team focus and operate effectively. We view this role as being analogous to the conductor of an orchestra. Technical managers need to provide leadership and coordination and ensure tasks are completed successfully. Closely related to, and often combined with technical leadership, is IT operations management, which is focused on service delivery and ensuring effective communication across the entire IT operations organization.

11.16 IT Operations Management

The IT operations management function helps ensure that individual teams are successfully working together to guarantee effective service delivery and especially excellent communication across the organization. IT operations should also be interfacing with managers in other parts of the organization, including development, information security, business, QA, and testing. Although IT operations overseas effective day-to-day activities, we also need to establish the rules of the road, and that is where operations controls become an important consideration.

11.17 IT Operations Controls

Controls are general guidelines or rules that are established to avoid costly mistakes, often developed in compliance with audit and regulatory requirements. We will discuss establishing IT controls further in Chapter 16, but in this section, we discuss how controls affect IT operations.

IT operations controls need to be both well understood and reasonable, or else the entire organization will look for every opportunity to bypass them. IT controls should be the guardrails that everyone is glad to see put in place, but do not necessarily slow the efforts to be successful and compete with other companies in the same space. One example of an IT control is the segregation of duties, where the person who writes the code should never be the person who compiles, packages, and deploys the code. Implemented properly, this IT control is viewed as being reasonable and effective.

Operations controls are essential, as is maintaining a secure and reliable physical environment.

11.17.1 Facilities Management

IT operations is also responsible for maintaining the physical environment. This consideration must include security and business continuity, among other factors.

The facilities must be maintained in a highly available and reliable way. The consideration must be in place for maintaining applications, including middleware and shared services.

11.18 Application Management

IT operations must usually organize support structures such that applications management is handled by a dedicated team, similar to (or perhaps the same as) the production support group discussed earlier in this chapter. Applications management within IT operations focuses on understanding how the application operates under full load in a production environment. We often see that the applications management team understands how the system operates in the real world in more depth than the developers who initially wrote the code.

11.18.1 Middleware Support

Understanding middleware typically requires specialized knowledge and is usually its own specialized function. Examples of middleware can include application queue management and web application servers such as Tomcat or WebSphere. We find that the complexity of these technologies requires that these support engineers have specialized skills, and the greatest challenge is often a lack of backups to the primary subject matter expert.

11.18.2 Shared Services

Shared services can include QA and testing services, which will be discussed in Chapter 20. Build and release engineering services are also typical shared services that should be provided consistently across the organization. Your organization can define other shared services that align with your business and technical requirements.

11.19 Security Operations

Security operations is a particularly important consideration from an IT operations perspective. We view security as being a full lifecycle endeavor. You cannot tack on security at the end of the application lifecycle, and we will discuss continuous security in Chapter 12. IT operations is responsible for access controls and ensuring that systems are configured to be secure and reliable. Just as we recommend standards and frameworks for establishing IT controls, so, too, do we recommend using well-respected security-related standards and frameworks.

11.19.1 Center for Internet Security

In previous engagements, we have used the consensus-based standards available from the Center for Internet Security (CIS) to guide the configuration of operating systems, including Linux. These standards provide short code examples that demonstrate exactly how to configure the operating system in a secure and reliable way. It turns out that it is very easy to use this code to create environment monitoring scripts to ensure that the configurations are maintained and are not changed, thereby compromising the security of the systems. Special considerations must be made when outsourcing IT operations.

11.19.2 Outsourcing

Outsourcing can create significant risks that need to be identified and addressed. Just because you are outsourcing work does not mean that you are no longer responsible for ensuring that proper security controls are maintained. We have been compelled to examine vendor controls, including site visits and audits. Reviewing vendor controls is also a consideration when using cloud-based resources.

11.20 Cloud-Based Operations

IT operations takes on some special considerations when working with cloud-based resources. We will discuss the agile ALM in the cloud in Chapter 17, but in this section we point out that the IT operations team must adjust its processes and functions for the special needs of the cloud. This includes understanding service-level agreements and exactly what tasks will be handled by the organization’s IT operations versus those handled by the cloud-based provider. One key strategy that we will discuss in Chapter 17 is ensuring that you are never locked into one specific provider.

11.20.1 Interfacing with Vendor Operations

Interfacing with vendor operations can be a difficult task. Unless your organization has established a very strong relationship with a vendor, you may find that your operations team does not have sufficient transparency into vendor operations to ensure that your service will not be impacted. This is a common problem. Organizations typically choose vendors to save money, although you may find, in this case, that you get what you pay for.

11.21 Service Desk

The service desk is the customer-facing function that manages requests typically based upon a service catalog. In practice, we find many organizations struggle to maintain service desks that are able to meet their objectives without frustrating and alienating their end users. The dysfunctional relationship that we observe is sometimes due to the service desk having a different approach and terminology that does not match the business and development organizations, who are often working on an agile transformation. We view agile principles and practices to be a perfectly appropriate approach to guiding IT operations, including those using the guidance found in the ITIL v3 framework. Our recommended approach is to have the service catalog embrace agile terminology and service delivery managers in place to ensure that IT operations can align with the business and development requirements.

Aligning the service desk with the organizational culture is the first step required and begins with the centralized service desk function.

11.21.1 Centralized

The centralized service desk provides an essential interface to request services from the IT operations organization. We have seen many instances where users struggle to interface with the service desk, including developers trying to request resources they need in order to meet their goals and deliverables. Whereas developers, business users, QA, and testing are all focusing on agile principles and practices, IT operations needs to evolve to have terminology that is aligned with the rest of the organization.

Centralized service desks are often virtual organizations, a situation that does present some challenges.

11.21.2 Virtual

We see service desks taking a virtual approach instead of trying to collocate their teams. This does create some challenges similar to those we discussed earlier in the chapter with regard to help desks. Similarly, we find specialized service desks also have specific requirements.

11.21.3 Specialized

It is common for IT operations organizations to have specialized service desks, especially if the technology requires specific expertise and access. Specialized service desks are also closely related to requests for vendor services.

11.21.4 Vendor Escalation

Organizations that embrace Software as a Service (SaaS) and Platform as a Service (PaaS) often have to manage service requests through the vendor. Although users may initially reach out directly to the SaaS or PaaS provider, when things do not go well, IT operations typically needs to get directly involved by escalating requests to the vendor.

11.22 Staffing the Service Desk

Staffing the service desk requires that you find resources who have the requisite technical expertise and the demeanor to interface with end-user requests. One strategy is to make time on the service desk a tour of duty that leads to other opportunities, which has the advantage of ensuring that everyone on the team understands how requests come in and the need to satisfy one’s customers, whether they be internal business colleagues or the developers trying to enhance and upgrade our systems.

IT operations also needs to manage incident and problem management.

11.23 Incidents and Problems

We discussed incident and problem management within the context of change management in Chapter 10. IT operations should have service delivery management resources to ensure that incident and problem management is handled successfully across the IT operations organization.

11.24 Knowledge Management

Capturing and managing knowledge is a key requirement in IT operations. There are many areas where specialized knowledge should be documented and reviewed with key stakeholders. Still, we find that many operations organizations struggle with capturing essential knowledge, a fact that contributes to development’s view that IT operations staff are less technically skilled than other technology professionals in the organization. We consistently apply a DevOps approach to enhancing communication and collaboration, looking to consistently capture, review, and disseminate essential technical information.

11.25 Conclusion

IT operations performs a vital service ensuring secure and reliable systems. Although we fully support making use of industry standards and frameworks to establish and evolve IT operations processes and function, we also believe that agile and Lean principles should be embraced to align IT Ops with the culture of the rest of the organization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 11. IT Operations

Create new playlist

Sign In

Sign Up

Chapter 11. IT Operations

11.1 Goals of IT Operations

11.2 Why Is IT Operations Important?

11.3 Where Do I Start?

11.4 Monitoring the Environment

11.4.1 Events

11.4.2 Incidents

11.4.3 Problems

11.5 Production Support

11.6 Help Desk

11.6.1 Virtual Help Desks

11.6.2 Remote Work

11.6.3 Virtual World Help Desk

11.6.4 Developers on the Help Desk

11.7 IT Process Automation

11.7.1 Knowledge Management

11.8 Workflow Automation

11.9 Communication Planning

11.9.1 Silos within the Organization

11.10 Escalation

11.10.1 Level 1

11.10.2 Level 2

11.10.3 Level 3

11.11 DevOps

11.12 Continuous Process Improvement

11.13 Utilizing Standards and Frameworks

11.13.1 ITIL v3

11.13.2 Knowledge Management

11.13.3 ISACA Cobit

11.14 Business and Product Management

11.15 Technical Management

11.16 IT Operations Management

11.17 IT Operations Controls

11.17.1 Facilities Management

11.18 Application Management

11.18.1 Middleware Support

11.18.2 Shared Services

11.19 Security Operations

11.19.1 Center for Internet Security

11.19.2 Outsourcing

11.20 Cloud-Based Operations

11.20.1 Interfacing with Vendor Operations

11.21 Service Desk

11.21.1 Centralized

11.21.2 Virtual

11.21.3 Specialized

11.21.4 Vendor Escalation

11.22 Staffing the Service Desk

11.23 Incidents and Problems

11.24 Knowledge Management

11.25 Conclusion

Table of Contents for
Chapter 11. IT Operations