Incident Resolution and Prevention

A Service Establishment and Delivery Process Area at Maturity Level 3

Purpose

The purpose of Incident Resolution and Prevention (IRP) is to ensure timely and effective resolution of service incidents and prevention of service incidents as appropriate.

Introductory Notes

The Incident Resolution and Prevention process area involves the following activities:

• Identifying and analyzing service incidents

• Initiating specific actions to address incidents

• Monitoring the status of incidents, tracking progress of incident status, and escalating as necessary

• Identifying and analyzing the underlying causes of incidents

• Identifying workarounds that enable service to continue

• Initiating specific actions to either address the underlying causes of incidents or to provide workarounds

• Communicating the status of incidents to relevant stakeholders

• Validating the complete resolution of incidents with relevant stakeholders

The term “incident” is used to mean “service incident” in this process area and in other areas of the model where the context makes the meaning clear. The term “service incident” is used in the glossary and in other parts of the model to clearly differentiate this specially defined term from the everyday use of the word “incident.” (See the definition of “service incident” in the glossary.)

Incidents are events that, if not addressed, eventually may cause the service provider organization to break its service commitments. Hence, the service provider organization must address incidents in a timely and effective manner according to the terms of the service agreement.

Addressing an incident may include the following activities:

• Removing an underlying cause or causes

• Minimizing the impact of an incident

• Monitoring the condition or series of events causing the incident

• Providing a workaround

Incidents may cause or be indications of interruptions or potential interruptions to a service.

Examples of interruptions to a service include a software application that is down during normal operating hours, an elevator that is stuck, a hotel room that is double booked, and baggage that is lost in an airport.

Examples of potential interruptions to a service include a broken component in resilient equipment, a line with more than three people in it at a counter of a supermarket, and an understaffed call center.

Customer complaints are a special type of potential interruption. A complaint indicates that the customer perceives that a service does not meet his or her expectations, even if the customer is in error about what the agreement calls for. Therefore, complaints should be handled as incidents and are within the scope of the Incident Resolution and Prevention process area.

All incidents have one or more underlying causes, regardless of whether the service provider is aware of the cause. For example, each system outage has an underlying cause, whether it is a memory leak, a corrupt database, or an operator error.

An underlying cause of an incident is a condition or event that contributes to the occurrence of one or more incidents. Not all underlying causes result in incidents immediately. For example, a defect in an infrequently used part of a system may not result in an incident for a long time.

Underlying causes can be any of the following:

• Root causes that are within the service provider’s control and can and should be removed

• Positive or negative conditions of a service that may or may not be removed

• Conditions that the service provider cannot change (e.g., weather conditions)

Underlying causes and root causes (as described in the Causal Analysis and Resolution process area) are not synonymous. A root cause is a type of underlying cause that is considered to be fundamental in some sense. We don’t normally look for the cause of a root cause, and we normally expect to achieve the greatest reduction in the occurrence of incidents when we address a root cause.

Sometimes, we are unable to address a root cause for practical or budgetary reasons, and so instead we may focus on other nonroot underlying causes. It doesn’t always make business sense to remove all underlying causes either. Under some circumstances, addressing incidents with workarounds or simply resolving incidents on a case-by-case basis may be more effective.

Effective practices for incident resolution start with developing a process for addressing incidents with the customers, end users, and other stakeholders who report incidents. Organizations may have both a collection of known incidents, underlying causes of incidents, and workarounds, as well as separate but related activities designed to create the actions for addressing selected incidents and underlying causes. Processing all incidents and analyzing selected incidents and their underlying causes to define approaches to addressing those incidents are two reinforcing activities that may be performed in parallel or in sequence.

Thus, the Incident Resolution and Prevention process area has three specific goals. The Prepare for Incident Resolution and Prevention goal helps to ensure that an approach is established for timely resolution of incidents and effective prevention of incidents when possible. The specific practices of the goal to Identify, Control, and Address Incidents are used to treat and close incidents, often by applying workarounds or other actions defined in the goal to Define Approaches to Address Selected Incidents.

Related Process Areas

Refer to the Capacity and Availability Management process area for more information about monitoring and analyzing capacity and availability.

Refer to the Service Delivery process area for more information about establishing and maintaining service agreements.

Refer to the Causal Analysis and Resolution process area for more information about determining causes of defects and problems.

Refer to the Configuration Management process area for more information about tracking and controlling changes.

Refer to the Project Monitoring and Control process area for more information about providing an understanding of the project’s progress so that appropriate corrective actions can be taken when the project’s performance deviates significantly from the plan.

Refer to the Risk Management process area for more information about identifying, analyzing, and mitigating risks.

Specific Practices by Goal

SG 1 Prepare for Incident Resolution and Prevention

Preparation for incident resolution and prevention is conducted.

Establish and maintain an approach for ensuring timely and effective resolution and prevention of incidents to ensure that the terms of the service agreement are met.

SP 1.1 Establish an Approach to Incident Resolution and Prevention

Establish and maintain an approach to incident resolution and prevention.

The approach to incident resolution and prevention describes the organizational functions involved in incident resolution and prevention, the procedures employed, the support tools used, and the assignment of responsibility during the lifecycle of incidents. Such an approach is typically documented.

Often, the amount of time needed to fully address an incident is defined before the start of service delivery and documented in a service agreement.

In many service domains, the approach to incident resolution and prevention involves a function called a “help desk,” “service desk,” or one of many other names. This function is typically the one that communicates with the customer, accepts incidents, applies workarounds, and addresses incidents. However, this function is not present in all service domains. In addition, other functional groups are routinely included to address incidents as appropriate.

Refer to the Service Delivery process area for more information about establishing and maintaining service agreements.

Typical Work Products

1. Incident management approach

2. Incident criteria

Subpractices

1. Define criteria for determining what an incident is.

To be able to identify valid incidents, criteria must be defined that enable service providers to determine what is and what is not an incident. Typically, criteria also are defined for differentiating the severity and priority of each incident.

2. Define categories for incidents and criteria for determining which categories an incident belongs to.

The resolution of incidents is facilitated by having an established set of categories, severity levels, and other criteria for assigning types to incidents. These predetermined criteria can enable prioritization, assignment, and escalation actions quickly and efficiently.

Appropriate incident categories vary according to the service. As an example, IT-related security incident categories could include the following:

• Probes or scans of internal or external systems (e.g., networks, Web applications, mail servers)

• Administrative or privileged (i.e., root) access to accounts, applications, servers networks, etc.

• Distributed denial of service attacks, Web defacements, malicious code (e.g., viruses), etc.

• Insider attacks or other misuse of resources (e.g., password sharing)

• Loss of personally identifiable information

There must be criteria that enable service personnel to quickly and easily identify major incidents.

Examples of incident severity level approaches include the following:

• Critical, high, medium, low

• Numerical scales (e.g., 1–5, with 1 being the highest)

3. Describe how responsibility for processing incidents is assigned and transferred.

The description may include the following:

• Who is responsible for addressing underlying causes of incidents

• Who is responsible for monitoring and tracking the status of incidents

• Who is responsible for tracking the progress of actions related to incidents

• Escalation procedures

• How responsibility for all of these elements is assigned and transferred

4. Identify one or more mechanisms that customers and end users can use to report incidents.

These mechanisms must account for how groups and individuals can report incidents.

5. Define methods and secure tools to use for incident management.

6. Describe how to notify all relevant customers and end users who may be affected by a reported incident.

How to communicate with customers and end users is typically documented in the service agreement.

7. Define criteria for determining severity and priority levels and categories of actions and responses to be taken based on severity and priority levels.

Examples of responses based on severity and priority levels include immediate short-term action, retraining or documentation updates, and deferring responses until later.

8. Identify requirements on the amount of time defined for the resolution of incidents in the service agreement.

Often, the minimum and maximum amounts of time needed to resolve an incident is defined and documented in the service agreement before the start of service delivery.

Refer to the Service Delivery process area for more information about establishing and maintaining service agreements.

9. Document criteria that define when an incident should be closed.

Not all underlying causes of incidents are addressed, and not all incidents have workarounds either. Incidents should not be closed until the documented criteria are met.

Often, closure codes are used to classify each incident. These codes are useful when these data are analyzed further.

SP 1.2 Establish an Incident Management System

Establish and maintain an incident management system for processing and tracking incident information.

An incident management system includes the storage media, procedures, and tools for accessing the incident management system. These storage media, procedures, and tools may be automated but are not required to be automated. For example, storage media might be a filing system where documents are stored. Procedures may be documented on paper, and tools may be hand tools or instruments for performing work without automated help.

A collection of historical data covering addressed incidents, underlying causes of incidents, known approaches to addressing incidents, and workarounds must be available to support incident management.

Typical Work Products

1. An incident management system with controlled work products

2. Access control procedures for the incident management system

Subpractices

1. Ensure that the incident management system allows the escalation and transfer of incidents among groups.

Incidents may need to be transferred or escalated between different groups because the group that entered the incident may not be best suited for taking action to address it.

2. Ensure that the incident management system allows the storage, update, retrieval, and reporting of incident information that is useful to the resolution and prevention of incidents.

Examples of incident management systems include the following:

• Indexed physical files of customer complaints and resolutions

• Bug or issue tracking software

• Help desk software

3. Maintain the integrity of the incident management system and its contents.

Examples of maintaining the integrity of the incident management system include the following:

• Backing up and restoring incident files

• Archiving incident files

• Recovering from incident errors

• Maintaining security that prevents unauthorized access

4. Maintain the incident management system as necessary.

SG 2 Identify, Control, and Address Incidents

Incidents are identified, controlled, and addressed.

The practices that comprise this goal include interaction with those who report incidents and those who are affected by them. The processing and tracking of incident data happens among these practices until the incident is addressed and closed.

Treatment of incidents can include collecting and analyzing data looking for potential incidents or simply waiting for incidents to be reported by end users or customers.

The specific practices of this goal may also depend on the practices in the goal to Define Approaches to Address Selected Incidents. It is often the case that the practices in that goal are used to define the approaches used to address selected incidents as called for in the goal to Identify, Control, and Address Incidents.

Often, incidents involve work products that are under configuration management.

Refer to the Configuration Management process area for more information about tracking and controlling changes.

SP 2.1 Identify and Record Incidents

Identify incidents and record information about them.

Capacity, performance, or availability issues often signal potential incidents.

Refer to the Capacity and Availability Management process area for more information about monitoring and analyzing capacity and availability.

Typical Work Products

1. Incident management record

Subpractices

1. Identify incidents that are in scope.

Examples of how incidents can be identified include the following:

• Incidents reported by the customer to a help desk by phone

• Incidents reported by the end user in a Web form

• Incidents detected by automated detection systems

• Incidents derived from the analysis of anomalies in data collected

• Monitoring and analyzing external sources of information (e.g., RSS feeds, news services, websites)

2. Record information about the incident.

When recording information about an incident, record sufficient information to properly support analysis and resolution activities.

Examples of information to record about the incident include the following:

• Name and contact information of the person who reported the incident

• Description of the incident

• Categories the incident belongs to

• Date and time of occurrence and date and time the incident was reported

• The configuration items involved in the incident

• Closure code and information

• Relevant characteristics of the situation in which the incident occurred

3. Categorize the incident.

Using the categories established in the approach to incident resolution and prevention, assign the relevant categories to the incident in the incident management system. Communicating with those who reported the incident about its status enables the service provider to confirm incident information early.

SP 2.2 Analyze Incident Data

Analyze incident data to determine the best course of action.

The best course of action may be to do nothing, address incidents on a case-by-case basis, provide workarounds for the incidents, remove underlying causes of incidents, educate end users, monitor for indicators of interference with service, or build contingency plans.

Typical Work Products

1. Major incident report

2. Incident assignment report

Subpractices

1. Analyze incident data.

For known incidents, the analysis may be done by merely selecting the type of incident. For major incidents, a separate incident resolution team may be assembled to analyze the incident.

2. Determine which group is best suited to take action to address the incident.

Which group is best suited to take action to address the incident may depend on a number of different factors, including the type of incident, locations involved, and severity.

Examples of groups that deal with different types of incidents include the following:

• A healthcare team deals with adverse medical outcomes.

• A network support group handles network connectivity incidents.

• A help desk deals with password-related incidents.

3. Determine actions that must be taken to address the incident.

Examples of actions include the following:

• Replacing a broken component

• Training customer, end user, or service delivery personnel

• Releasing an announcement (e.g., public relations release, media response, bulletin, notice to customers or other stakeholders)

4. Plan the actions to be taken.

SP 2.3 Apply Workarounds to Selected Incidents

Apply workarounds to selected incidents.

Applying workarounds can reduce the impact of incidents. Workarounds can be applied instead of addressing underlying causes of the incident. Or, the workaround can be used temporarily until underlying causes of the incident can be addressed.

It is essential to have a single repository established that contains all known workarounds. This repository can be used to quickly determine the workaround to be used for related incidents.

Typical Work Products

1. Updated incident management record

Subpractices

1. Address the incident using the workaround.

2. Manage the actions until the impact of the incident is at an acceptable level.

3. Record the actions and result.

SP 2.4 Address Underlying Causes of Selected Incidents

Address underlying causes of selected incidents.

After the underlying causes of incidents are identified and analyzed, and action proposals are created using the specific practices of the goal to Define Approaches to Address Selected Incidents, the underlying causes must be addressed.

It is essential to have a single repository established that contains all known incidents, their underlying causes, and approaches to addressing these underlying causes. This repository can be used to quickly determine the causes of related incidents.

Typical Work Products

1. Updated incident management record

Subpractices

1. Address the underlying cause using the action proposal that resulted from the analysis of the incidents’ underlying causes.

SSD Add

Refer to the Service Delivery process area for more information about maintaining the service system.

2. Manage the actions until the underlying cause is addressed.

Managing the actions may include escalating the incidents as appropriate.

Examples of escalation criteria include the following:

• When the impact of the incident on the organization or customer is large

• When addressing the underlying cause of an incident will take considerable time or effort

3. Record the actions and result.

The actions used to address the incident or its underlying cause and the results of those approaches are recorded in the incident management system to support resolving similar incidents in the future.

SP 2.5 Monitor the Status of Incidents to Closure

Monitor the status of incidents to closure and escalate if necessary.

Throughout the life of the incident, the status of the incident must be recorded, tracked, escalated as necessary, and closed.

Refer to the Project Monitoring and Control process area for more information about providing an understanding of the project’s progress so that appropriate corrective actions can be taken when the project’s performance deviates significantly from the plan.

Typical Work Products

1. Closed incident management records

Subpractices

1. Document actions and monitor and track the incidents until they meet the terms of the service agreement and satisfy the incident submitter as appropriate.

The incident should be tracked throughout its life and escalated, as necessary, to ensure its resolution. Monitor the responses to those reporting the incident and how the incident was addressed until it is resolved to the customer’s or organization’s satisfaction.

2. Review the resolution and confirm the results with relevant stakeholders.

Confirming that the underlying causes were successfully addressed may involve confirming with the person who reported the incident or others involved in analyzing the incident that the actions taken in fact resulted in the incident no longer occurring. Part of the result of addressing the incident may be the level of customer satisfaction. Now that the incident has been addressed, it must be confirmed that the service again meets the terms of the service agreement.

3. Close incidents that meet the criteria for closure.

SP 2.6 Communicate the Status of Incidents

Communicate the status of incidents.

Communication is a critical factor when providing services, especially when incidents occur. Communication with the person who reported the incident and possibly those affected by it should be considered throughout the life of the incident record in the incident management system. Well-informed end users and customers are more understanding and can even be helpful in addressing the incident successfully.

Typically, the results of actions are reviewed with the person who reported the incident to verify that the actions indeed resolved the incident to the satisfaction of the submitter.

Typical Work Products

1. Records of communication with customers and end users

SG 3 Define Approaches to Address Selected Incidents

Approaches to address selected incidents are defined to prevent the future occurrence of incidents or mitigate their impact.

All incidents have one or more underlying causes that trigger their occurrence. Addressing the underlying cause of some incidents may reduce the likelihood or impact of service interference, reduce the workload on the service provider, or improve the level of service. The practices in this goal cover the analysis of incidents to define how to address them. The results of this analysis are fed back to those who control and address incidents.

Underlying causes can be identified for incidents and possible future incidents.

Examples include analyzing the cause of a delivery error or system outage and monitoring use of software memory to detect memory leaks as soon as possible.

The root cause of an incident is often different from the immediate underlying cause. For example, an incident may be caused by a faulty system component (the underlying cause), while the root cause of the incident is a suboptimal supplier selection process. This process area uses the term “underlying cause” flexibly, ranging from immediate causes or conditions to deeper root causes, to allow for a range of possible responses from workarounds to prevention of a class of related incidents.

Refer to the Causal Analysis and Resolution process area for more information about determining causes of defects and problems.

SP 3.1 Analyze Selected Incident Data

Select and analyze the underlying causes of incidents.

The purpose of conducting causal analysis on incidents is to determine the best course of action to address the incidents so that they don’t happen again. Possible courses of action include not addressing the underlying cause and continuing to deal with incidents as they occur or providing a workaround.

Often, analyzing incidents involves work products that are under configuration management.

Refer to the Configuration Management process area for more information about tracking and controlling changes.

Typical Work Products

1. Report of underlying causes of incidents

2. Documented causal analysis activities

Subpractices

1. Identify underlying causes of incidents.

Examples of approaches to identifying underlying causes of incidents include the following:

• Analyze incidents reported by customers to a help desk

• Monitor the service system to identify potential incidents

• Analyze trends in the use of resources

• Analyze strengths and weaknesses of the service system

• Analyze mean times between service system failures and availability

• Analyze external sources of information, such as alerts, news feeds, and websites

Refer to the Risk Management process area for more information about identifying, analyzing, and mitigating risks.

2. Record information about the underlying causes of an incident or group of incidents.

When recording information about the underlying causes of an incident, record sufficient information to properly support causal analysis and resolution.

Examples of information to record include the following:

• Incidents affected or potentially affected by the underlying cause

• Configuration items involved

• Relevant characteristics of the situation in which the incidents did or could occur

3. Conduct causal analysis with the people who are responsible for performing related tasks.

For underlying causes of major incidents, the analysis may involve assembling a separate team to analyze the underlying cause.

Refer to the Causal Analysis and Resolution process area for more information about determining causes of defects and problems.

SP 3.2 Plan Actions to Address Underlying Causes of Selected Incidents

Identify the underlying causes of selected incidents and create an action proposal to address these causes.

After analysis has determined the underlying causes of incidents, the actions to be taken, if any, are planned. Planning includes determining who will act, when, and how. All of this information is documented in an action proposal. The action proposal is used by those who take action to address the underlying causes of incidents.

Typical Work Products

1. Action proposal

2. Contribution to collection of known approaches to addressing underlying causes of incidents

Subpractices

1. Determine which group is best suited to address the underlying cause.

Which group is best suited to address the underlying cause may depend on the type of underlying cause, configuration items involved, and the severity of the relevant incidents.

Examples of groups and departments that deal with different types of underlying causes include the following:

• A network support group handles network issues.

• A UNIX server support team deals with server configuration issues.

• Human Resources controls personnel and privacy issues.

• The Legal department controls issues relating to intellectual property, disclosure of information, data loss, etc.

• Public Relations is responsible for issues relating to the reputation of the organization.

2. Determine the actions to be taken to address the underlying cause.

When analyzing standard incidents, the actions for addressing that standard incident may be documented as a standard action plan. If the incident is not standard, a historical collection of addressed incidents and known errors should be searched to see if the incident is related to others. This data should be maintained to allow this kind of analysis, thus saving time and leveraging effort.

Examples of actions taken to address the underlying cause include the following:

• Replacing a broken component

• Training end users or service delivery staff

• Fixing a software defect

• Not addressing the underlying cause because it is cheaper or less risky to deal with the incidents than address the underlying cause

Refer to the Decision Analysis and Resolution process area for more information about analyzing possible decisions using a formal evaluation process that evaluates identified alternatives against established criteria.

3. Document the actions to be taken in an action proposal.

4. Verify and validate the action proposal to ensure that it effectively addresses the incident.

5. Communicate the action proposal to relevant stakeholders.

SP 3.3 Establish Workarounds for Selected Incidents

Establish and maintain workarounds for selected incidents.

Workarounds are important mechanisms that enable the service to continue in spite of the occurrence of the incident. Therefore, it is important that the workaround be documented and confirmed to be effective before it is used to address incidents with customers and end users.

Typical Work Products

1. Workaround description and instructions

2. Contribution to collection of workarounds for incidents

Subpractices

1. Determine which group is best suited to establish and maintain a workaround.

Determine which group is best suited to define the workaround, describe the steps involved, and communicate this information appropriately.

2. Plan and document the workaround.

3. Verify and validate the workaround to ensure that it effectively addresses the incident.

4. Communicate the workaround to relevant stakeholders.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.71.28