Having an incident response plan is a critical step in cyber breach response. The worst time for an organization to realize that they are not prepared for an incident is when a cyber breach occurs.
An effective incident response plan encompasses an incident management process, roles and responsibilities, communication flows, escalations, and postmortem activities, among other components. Each of those components is vital to respond effectively to various types of incidents and help organizations operate during significant cyberattacks.
This chapter discusses the incident response lifecycle, how to build an effective incident response plan, and how to improve incident response capabilities continuously.
An incident response lifecycle is a conceptual model that represents the different phases during the lifespan of a cybersecurity incident. To respond to incidents effectively, incident responders need to follow a structured and organized approach with clearly defined roles and responsibilities.
Various industry standards present slightly different lifecycle approaches. However, they draw on similar concepts, and there is usually a significant amount of overlap between them.
Figure 4.1 displays a lifecycle that the National Institute of Standards and Technology (NIST) in the United States included in its Computer Security Incident Handling Guide.1
In my personal experience, this lifecycle closely reflects the typical sequence of stages and functional activities during the incident response process and is a good model for organizations of all sizes to follow.
Preparation is a fundamental component of the incident response lifecycle that directly impacts the remaining stages. Without appropriate and continuous preparation, incident response teams cannot effectively execute activities in the remaining stages.
Preparation encompasses a wide range of activities that require an integrated and cross-functional effort, funding, and other resources, as well as executive-level support. Arguably, this book is all about preparation and ensuring that enterprises can develop capabilities to enable them to respond to and handle cybersecurity incidents effectively.
It is crucial to emphasize that preparation is not a one-off activity. Instead, it is a continuous and incremental effort that tightly integrates with a continual improvement process that I discuss later in this chapter.
The following list describes common activities that enterprises may undertake as part of the preparation phase. This list is by no means comprehensive. Strategic planning typically drives these activities.
Incident symptoms can manifest themselves in numerous ways. Analysts must pay close attention to potential adverse events to determine whether those events are indicative of an incident, or even a breach. Incident detection and analysis is the process of identifying suspicious events and assessing their details to determine an appropriate response procedure.
Systems and software applications may generate a significant volume of events each day. Analysts often rely on correlation tools and alerting mechanisms to detect those events that may be indicative of potentially malicious activity.
After detecting a suspicious event, analysts must triage it to determine whether the event warrants an incident declaration. As part of the triage phase, analysts need to apply business and technology context to determine the nature of the event. For example, an analyst might review a configuration baseline of a specific application to determine whether an investigated event could have been generated in the course of normal system operations and that it represents an expected system behavior. The outcome of the detection and triage phases is a decision whether or not to proceed with incident declaration.
During the detection and triage phases, analysts typically document known event information; collect contextual data, such as asset configuration and system and network architecture; correlate the incident information with previous similar incidents; and review cyber threat intelligence (CTI) associated with the events of interest.
If an event warrants incident declaration, analysts typically determine the scope and impact, assign classification and prioritization to the incident, and log an incident ticket in a case management platform. Analysts may also escalate the incident if it reaches a specific severity threshold. It is crucial to note that as the investigation progresses and analysts establish new evidence, the scope, impact, and severity of the incident may change.
The detection and triage phases can be performed as one task or two separate activities. The latter approach is typical of organizations that have a mature incident response capability or have outsourced security monitoring to a managed security service provider (MSSP).
In some situations, it is clear that an event is indicative of an incident. For example, an internal server attempts to open a network connection to an infrastructure associated with a threat actor. In other cases, an analyst might escalate an event to incident response personnel to perform a more in-depth analysis to establish whether the event is indicative of a cyber incident.
Incident analysis refers to a structured process designed to examine security data and evaluate the cause of an incident. As part of the analysis phase, incident responders create indicators of compromise (IOC) that the enterprise can use to drive containment and eradication activities. To perform a sound and thorough analysis, incident responders must possess advanced skills and experience in the area of digital forensics. Chapter 2 discusses skills in detail. Depending on the nature of an incident, typical analysis activities may include the following:
Containment, eradication, and recovery are the final phases of the incident response lifecycle. Organizations often refer to these activities as remediation.
Containment encompasses specific actions that enterprises take to prevent a threat from spreading, thereby limiting the damage that a cyberattack has on their business. Depending on the nature of a cyberattack, incident responders may execute containment activities in a single step or take a two-phased approach: short-term containment and long-term containment.
After successfully containing an incident, the next step is to take short-term and medium-term steps to eradicate the threat actor from the compromised environment. As in the case of containment, eradication strategies vary from incident to incident. For this reason, incident response teams should document containment and eradication procedures for common incident types that occur in their environments. Examples of eradication actions include the following:
Incident responders need to understand the full scope of a compromise and the associated IOCs before executing containment and eradication activities. As part of the attack strategy, attackers typically establish several footholds in the compromised environment. An insufficiently remediated environment may still allow the attacker access to the network. If there is a means for the attacker to come back to the compromised network, they will likely do so.
Containment and eradication may sometimes be at odds with forensic data acquisition. In some cases, ad hoc containment or eradication can destroy forensic evidence. For example, it is not uncommon for information technology support personnel to re-image malware-infected end-user workstations without considering the need to acquire and preserve digital evidence. Consequently, creating and documenting containment strategies for specific incident types is vital to ensure a balance between effective containment and preserving forensic evidence, especially if the evidence may be required for legal purposes.
The objective of the recovery phase is to restore the affected systems and software applications fully into an operational state. Also, as part of recovery, enterprises need to ensure that the restored systems and applications have appropriate controls in place to prevent similar incidents from reoccurring. For example, if a threat actor compromised domain administrative credentials, an enterprise may need to consider controls such as multifactor authentication (MFA) to prevent a similar compromise in the future. MFA is an authentication scheme that requires at least two independent pieces of information from separate categories of credentials, such as a password and a random number generated by an authentication token, to authenticate into a system.
During incidents and breaches that have a severe impact on business operations, enterprises may need to activate a disaster recovery (DR) plan and switch operations to alternate systems that are in a known good state during recovery. Enterprises can leverage standards, such as ISO 22301:2019, to prepare for, respond to, and recover from disruptive events.2
Enterprises typically recover systems and software applications in a phased approach. Depending on the nature and scope of an incident, it may take days or even weeks to recover fully. For this reason, project management and close collaboration between technology and business stakeholders, as well as vendors, is vital.
Post-incident activities include steps that enable organizations to capture opportunities for improvement and implement specific measures to enhance their cyber breach response program. Post-incident activities are typically part of continual improvement that enterprises implement to govern a cyber breach response program. Organizations can implement two types of measures as part of continual improvement: tactical and strategic.
Tactical measures encompass relatively simple program enhancements that do not require substantial funding or dedicated projects. Enterprises typically identify tactical measures through a lessons-learned meeting and a root cause analysis. To identify tactical improvement measures, organizations need to hold a lessons-learned meeting after each major incident, as well as regular meetings to review responses to minor incidents. The outcome of a lessons-learned meeting is a set of tactical measures that enterprises can implement within a relatively short time.
If an enterprise identifies major gaps and issues that cannot be resolved as part of day-to-day operations, the incident response manager must escalate those issues to senior management for risk evaluation. Based on the outcome of the risk evaluation process, an enterprise may decide to implement strategic measures that typically require funding and dedicated projects. In other words, strategic measures are long-term enhancements that require dedicated resources. For example, an organization may choose to implement an endpoint detection and response (EDR) tool as a strategic measure to reduce the risk associated with threats that evade traditional antimalware software. Another way of thinking about tactical vs. strategic measures is in terms of short-term and long-term enhancements.
Process-mature organizations typically implement strategic measures as part of a continuous improvement process. Enterprises implement this process as part of governance to evaluate the performance of a cyber breach response program continually against its objectives. A steering committee is typically a body that is responsible for implementing and enforcing governance.
The purpose of a cybersecurity incident management process is to minimize the operational and informational impact of a cybersecurity incident, as well as to ensure that the enterprise can continue business operations while under a cyberattack.3 Consequently, the objectives of the incident management process include the following:
An effective incident management process helps enterprises manage residual risk and ensure that business, cybersecurity, and technology stakeholders work together toward the same goals.
The scope of an incident management process encompasses all cybersecurity incidents that negatively impact an enterprise, not only cyber breaches. If not addressed promptly, even minor incidents can progress into enterprise breaches that may negatively impact revenue, brand reputation, and business operations and lead to legal exposure. For this reason, enterprises need to create a cross-functional cybersecurity incident management process to address the aftermath of a cyberattack before real damage occurs.
An effective incident management process delivers several benefits to organizations:
An incident management process is a structured and coordinated set of activities designed to address the aftermath of a cyberattack. A high-quality process contains several components that work in tandem to produce a desired and repeatable outcome.
Enterprises should define specific events that trigger an incident management process. When an event triggers an incident management process, the process takes an input, executes a set of activities, and produces an outcome. For example, a security alert may trigger an incident management process. Incident responders take all known information about the incident and proceed through a series of activities to understand its nature and prevent its spread within the enterprise. The outcome of the process may be incident resolution and control recommendations designed to prevent similar incidents from reoccurring.
Figure 4.2 captures an incident management process as a set of components organized into three categories: controls, enablers, and the process itself. I derived this model from the Information Technology Infrastructure Library (ITIL).4
It is worth mentioning that not all organizations choose to implement metrics and process improvements. Chapter 1 discussed the Capability Maturity Model Integration (CMMI) model that defines process maturity levels.5 Striving to quantitatively manage and optimize an incident management process may not be the end goal for all organizations. In fact, for smaller organizations that still do not grasp cybersecurity, this approach may lead to inefficient use of organizational resources.
Furthermore, the ISO 9001:2015 standard6 established a hierarchical relationship between process, procedure, and work instruction, and helped to clarify crucial terms that apply to incident management. Figure 4.3 depicts the relationship between these concepts.
Controls are process components that are necessary to ensure that the process produces outcomes in accordance with the established objectives. Process controls are also vital to continual improvement. Enterprises need to establish the following controls to ensure that their incident management process is up to managing the response to cyberattacks effectively.
Process enablers make up the resources and capabilities required to support an incident management process and achieve specific outcomes.
Chapter 2 discussed a CSIRT as a cross-functional team consisting of cybersecurity, technology, and business stakeholders, as well as third parties that convene to respond to cybersecurity incidents. As part of defining an incident management process, enterprises need to establish interfaces to other organizational functions and integrate with their processes to ensure effective response.
A process interface determines the inputs that an incident management process receives from other processes, as well as the outputs that it produces that may trigger other, subsequent processes. In practical terms, the inputs and outputs are pieces of information that various organizational functions receive or pass into the incident management process so that it can achieve its objectives.
For example, a technology support group may investigate an operational incident, diagnose a cyberattack as its root cause, and trigger the cybersecurity incident management. The outcome from the operational investigation feeds as an input into the cybersecurity incident management process. In contrast, if the cybersecurity incident management process produces evidence of unauthorized access to data that is subjected to legal and regulatory requirements, the findings would trigger and feed into a data privacy process.
To make the transitions between various organizational processes smooth, enterprises need to create communication templates to ensure that the information they pass from one process to another is fit for its intended purpose. For example, the information should contain the necessary details, yet be succinct, written for the target audience, and actionable to facilitate decision making.
The following list briefly describes the typical processes that integrate with a cybersecurity incident management process.
A cybersecurity incident management process typically includes several roles that are responsible for fulfilling various activities within the process. Chapter 2 identified typical business, technology, and third-party functions that an incident response team may call upon to assist in their respective domains. Enterprises must identify, assign, and document roles and responsibilities to those functions to ensure that they carry out and fulfill their activities within the incident management process.
An outline of roles and responsibilities should be succinct, yet contain the necessary details relating to the process. Typically, a brief description of the role and a bulleted list of responsibilities works well.
Another, often complementary, method to documenting roles and responsibilities is to create a responsibility assignment chart. A RACI chart allows organizations to map out roles and responsibilities to activities. The acronym is derived from four responsibilities that organizations typically designate:
Enterprises can leverage service level agreements (SLAs) and operational level agreements (OLAs) to ensure that entities that participate in the incident management process complete their tasks within the expected time frame and to the agreed-on quality. An SLA is a formal agreement between an organization and a service provider. In contrast, an OLA represents a formal commitment between internal entities.7 For example, an enterprise may establish a response SLA with an incident response firm as part of a retainer agreement. An example of an OLA is an agreement between an incident response team and a technical group within the same organization to establish commitments, such as response time to support an investigation.
Enterprises need to establish SLAs and OLAs that are measurable to determine whether the entities that participate in the incident management process meet their targets. Chapter 1 discussed the SMART framework. Enterprises can also leverage this framework to create unambiguous targets as part of defining OLAs and SLAs.
SLAs and OLAs are excellent tools to set and measure process targets. However, a poor choice of SLAs and OLAs may lead to an incorrect perception of a cyber breach response program. For example, in the world of IT operations, incidents typically have obvious symptoms, such as the low performance of an application component. In such cases, response and resolution targets make perfect sense. In contrast, the symptoms of a cybersecurity incident may not be so obvious. In some cases, a threat actor may dwell on a compromised network for weeks or even months before detection. It may also take a complex investigation that consumes a considerable amount of time before the enterprise determines the full scope and impact of the compromise. For this reason, targets that make perfect sense in IT operations may not always be appropriate for a cyber breach response program. Organizations must very carefully define SLAs and OLEs to ensure that they convey accurate information and are useful in making decisions. For example, defining a metric focused on the time it takes to close an incident typically results in incident response teams focusing on resolving incidents to meet metric requirements at the expense of a thorough investigation. In contrast, agreeing to a support response time SLA with an incident response partner is more beneficial and sets clear expectations for how soon the partner must start providing support.
A workflow is a sequence of activities executed in a chronological order to manage the lifecycle of an incident. A typical workflow includes decision points that determine the direction of activity flow based on conditions and rules. A workflow may also contain interfaces to other processes and subprocesses. This section presents a conceptual workflow that captures the main steps in an incident management process. In practice, workflows may be more complex with many decision points. Furthermore, incident response teams often iterate over some of the steps as they scope intrusions and learn about how a threat actor operates in a compromised environment as depicted in Figure 4.4.
Enterprises can detect and identify incidents and cyber breaches through numerous sources. For this reason, it is critical that business and technology stakeholders know how to report suspicious events and incidents to the incident response team. The following list briefly discusses the typical avenues that lead to the detection or identification of an incident.
Organizations need to identify and triage events and incidents that stakeholders report through these sources to establish whether those notifications warrant incident declaration. The next step is to classify and document the incidents before commencing an in-depth analysis.
A vital element of incident management is incident classification and documentation. Organizations need to record and maintain a full historical record of incidents for several reasons, including the following:
As part of incident documentation, analysts need to classify incidents by assigning an appropriate category and severity level to them. Incident classification helps communicate the nature and impact of an incident to business and technology stakeholders. It is also vital to establishing metrics and measuring a cyber breach response program.
Incident categorization involves assigning a category to an incident that reflects the nature of the incident and the resources required to address it. Organizations should establish a categorization scheme that is most appropriate for their maturity and metric requirements.
A single-tier scheme typically works for enterprises that are in the early phases of setting up a cyber breach response program. As the program matures and the management identifies the need to track metrics at a granular level to make precise operational adjustments, an organization may choose to implement a two-tier scheme.
A single-tier scheme typically encompasses several high-level categories, such as malware or denial of service. A two-tier scheme, on the other hand, defines high-level incident categories at the top tier and incident types or subcategories at the lower tier. For example, an incident of type “malware” may be further categorized as virus, ransomware, bot, or worm.
To communicate cybersecurity incidents clearly throughout the organization, enterprises must create an incident taxonomy or adopt a taxonomy established by an industry framework. For example, NIST SP 800-61 Revision 2 proposes the following taxonomy:9
Severity identifies the importance of an incident to the enterprise, and it is the primary driver for response actions that an organization takes to resolve it. Two primary criteria exist to determine a severity level of a cybersecurity incident: impact and urgency.
Impact is a measure of the potential damage that an incident causes before it is contained. An incident can cause an operational or informational impact.10
Enterprises should establish both operational and informational criteria to identify the severity of an incident accurately. Enterprises historically used impact on business operations as a criterion to calculate the severity of an IT incident. However, cybersecurity incidents may severely impact organizations while not having an operational impact. For example, unauthorized access or theft of protected data may lead to legal exposure, brand reputation damage, or a loss of competitive advantage in the marketplace in case of intellectual property theft. For this reason, cybersecurity incidents require informational impact evaluation, too. In cases when a cyberattack causes both operational and informational impact, organizations should use the higher impact score in the severity calculation.
Operational and informational impact criteria may vary among organizations, depending on factors such as risk appetite and the nature of the business. Table 4.1 and Table 4.2 provide examples of typical impact criteria that enterprises may establish.
Urgency is the time that must elapse before an incident has a significant business impact. The shorter the time, the higher the urgency. For example, a ransomware outbreak on the corporate network may need immediate action to prevent a severe business impact. Evidence of a threat actor performing reconnaissance of an Internet-exposed web application may not necessitate the same level of response.
Table 4.1: An example of operational impact criteria
IMPACT | DESCRIPTION |
Extensive | An incident severely impacts multiple core business functions and significantly impairs business operations. An extensive impact may also warrant a crisis declaration. |
Significant | An incident severely impacts one or two core business functions. |
Moderate | An incident has little impact on business operations. |
Minor | An incident has no impact on business operations and is handled as part of day-to-day operations. |
Table 4.2: An example of informational impact criteria
IMPACT | DESCRIPTION |
Extensive | An attacker exfiltrates trade secrets, intellectual property, or a significant amount of data that is subjected to legal and regulatory requirements. |
Significant | An attacker gains unauthorized access to protected data and exfiltrates a small subset of that data. There is also a possibility that the attacker can exfiltrate additional data. |
Moderate | An attacker has gained a foothold in an environment and actively enumerates systems and applications for the data of interest. However, there is no evidence of unauthorized access to or exfiltration of protected data. |
Minor | An attacker has compromised a non-business-critical system or application that does not store or process protected data. Also, there is no evidence of lateral movement within the environment. |
As with impact, enterprises need to establish clear urgency criteria to use in severity calculation. Table 4.3 provides an example of typical urgency criteria based on how far an attacker progressed through the cyberattack lifecycle.
By correlating impact and urgency, organizations can derive the severity level for an incident. The higher the severity, the more damage an incident may cause. The severity level drives the resources that an enterprise dedicates to resolve it.
Table 4.3: An example of urgency criteria
IMPACT | DESCRIPTION |
Critical | An attacker accesses highly confidential data or actively deploys destructive malware. |
High | An attacker establishes C2 communication and has moved laterally within the compromised network. |
Medium | An attacker exploits a vulnerability and gains unauthorized access to a system or a software application. The attacker may also establish a foothold to maintain access to the network. |
Low | An attacker performs an active attack on the organization. However, there is no evidence that the attack succeeded. |
One way to establish the severity of an incident is to create a matrix of impact and urgency. Enterprises can also leverage the matrix to create a heat map by shading matrices with colors. Table 4.4 provides a matrix example based on the impact and urgency criteria discussed earlier in this section.
Table 4.4: An example of a severity matrix
URGENCY | EXTENSIVE | IMPACT SIGNIFICANT |
MODERATE | MINOR |
Critical | 1 | 1 | 2 | 2 |
High | 1 | 2 | 2 | 3 |
Medium | 2 | 2 | 3 | 3 |
Low | 3 | 3 | 4 | 4 |
Enterprises can use a numerical or descriptive scheme to designate severity levels. Regardless of the choice of a scheme, organizations need to determine what level of response warrants each severity level. The following list contains examples of severity definitions based on the impact and urgency criteria examples discussed earlier in this section.
Incident classification is a dynamic process, and enterprises may need to adjust the category and severity assignment throughout its lifecycle to reflect its impact and scope accurately.
Nowadays, most enterprises use a ticketing tool or a dedicated incident response platform to track incidents. Analysts leverage those tools to create records that contain a set of data relating to incidents. Some platforms also allow their users to associate incident records with other items, such as a known vulnerability, among other features.
Enterprises need to consider incident record requirements carefully. Specific information, such as a case number or an incident type, applies to all records. Other information may be incident-specific or not available at the time when an analyst creates an incident record, such as the incident scope. I have come across organizations that enforce strict documentation requirements in their case management systems by configuring mandatory fields. However, this approach can hinder the analyst's ability to create a ticket at the early stages of an event, especially in cases where the available information is limited or unknown. A more sensible approach is to enact a policy that requires an incident response team to document information throughout an investigation and update electronic incident records appropriately. Furthermore, management can review incident documentation regularly and implement specific controls when the documentation is lacking in detail or quality.
The following list contains typical incident attributes that enterprises may choose to track in an incident record:
A cybersecurity incident typically progresses through several stages from the initial detection to its closure. The following list briefly explains each of those states.
As part of their cybersecurity incident management process, enterprises need to establish escalation procedures. Incidents that typically require escalation may become major incidents, and organizations need to ensure appropriate response before they cause a significant operational or informational impact. Two types of escalations can occur: hierarchical and functional.
A hierarchical escalation raises an incident to higher management levels. Incident response personnel may engage senior management during major incidents to set priorities and provide a strategic direction to the incident response team. For example, when containment requires the organization to isolate a system that supports a core business function from the network, senior management needs to decide whether system availability or containment takes precedence.
A hierarchical escalation may also be necessary if the incident response team requires additional resources or a higher level of support from a group that participates in the process.
Functional escalation raises an incident to a higher level of skill or expertise. Some investigations may require advanced DFIR expertise. In such cases, an analyst may escalate an incident to senior personnel to support or take over the investigation. Another example of functional escalation is hiring external consultants to assist with a specific aspect of the investigation, such as malware reverse-engineering.
Functional and hierarchical escalations are not mutually exclusive, and in some cases, they complement each other. In the previous example, an incident response team needs to escalate an incident to senior management that typically has the required level of authority to engage external consultants.
Investigating complex incidents and breaches typically requires the contribution of multiple cybersecurity, business, and technical stakeholders. To make this effort manageable, enterprises can break down a case into tasks and assign those tasks to specific roles.
A task is a unit of work that a specific role must complete to advance an investigation. For example, an incident manager may assign the task of acquiring forensic images of compromised hosts to one analyst, whereas another analyst may perform log analysis. An incident manager must associate all tasks with the investigated case and track their progress. Furthermore, a single role may be responsible for one or more tasks.
Some incident response platforms allow enterprises to create templates for specific incident types. As part of a template, analysts can create predefined tasks that an incident manager can assign during an investigation. This approach saves time and helps ensure a consistent response to similar incidents. The next section discusses how organizations can leverage this approach to build playbooks for specific incident types.
Major incidents are incidents that cause a significant operational or informational impact and require prompt response. Severity 1 and Severity 2 incidents, as defined in a previous section, would qualify as major incidents.
Major incidents require vertical escalation to senior management and often a dedicated management workstream in addition to the technical workstream. For this reason, enterprises need to develop a major incident response process. The process should include the necessary provisions for both tactical and management workstreams that may include the following:
Communication and collaboration are vital to managing major incidents. Both business and technical stakeholders need to be informed about the progress of the incident investigation. For this reason, enterprises need to develop a comprehensive communications plan. Chapter 2 established a CSIRT coordination model with two key roles: incident manager and incident officer. As part of the process, enterprises have to assign a dedicated incident manager to lead the tactical workstream and directly work with the incident officer to ensure that business priorities drive the investigation. These two stakeholders are crucial to managing communication at the tactical and management levels, respectively.
Furthermore, enterprises need to conduct a post-incident review after each major incident to evaluate the response and identify opportunities for improvement. The section “Continual Improvement” discusses this topic in depth.
Incident closure is the final activity in the process. It follows recovery and typically includes the following steps:
An incident response playbook is a predefined series of steps that an incident response team takes when responding to a particular incident type. This approach allows organizations to streamline their processes and handle similar or routine incidents efficiently and consistently. Playbooks are also known as standard operating procedures (SOPs), runbooks, or incident models.
A playbook is an extension of an incident management process, and all of the steps that it contains must conform to the process. An incident management process focuses on an overarching approach to handle the lifecycle of all cybersecurity incidents. In contrast, a playbook focuses on a step-by-step procedure to respond to a specific incident type. Another way of thinking about playbooks is in terms of incident-specific response procedures. For example, a malware outbreak playbook contains concrete steps describing how to handle a malware outbreak at each phase of the incident response lifecycle.
As illustrated in Figure 4.3, organizations can also create work instructions describing how to perform specific steps within a playbook. Specific work instructions may apply across multiple playbooks and are typically valuable to new hires or junior personnel. For example, if a playbook contains a step to acquire a forensic image of a compromised system, an analyst can reference a documented work instruction on how to perform this task for a specific platform, such as Windows or Linux.
A typical playbook contains the following components:
A playbook may also contain interfaces to invoke other playbooks or processes. For example, an Unauthorized Data Access playbook could contain a step to trigger a data privacy incident process.
For playbooks to be effective, they must address specific incident types and focus on steps that are specific to those types only. For example, developing separate playbooks for ransomware and backdoor malware is more effective than creating a generic malware playbook. Also, playbooks should be concise, practical, and in the majority of cases, limited to no more than 25 steps. If a playbook requires more than 25 steps and a complex workflow with several decision points, it is likely too generic and should be broken down into more specific playbooks.
In my consulting engagements, I regularly come across organizations with a “playbook mentality,” who believe that scripting response procedures is a silver bullet to incident response. This cannot be further from the truth. Enterprises should create playbooks to guide their incident response teams and provide a repeatable procedure with documented preapproved decisions. However, it is vital to understand that even incidents of the same type vary in complexity, scope, and the tools, tactics, and procedures (TTP) that an adversary employs to progress through the cyberattack lifecycle. For this reason, even the most sophisticated playbook cannot replace critical thinking and the experience of a seasoned incident response professional.
For that reason, during client engagements I emphasize to senior management that playbooks should guide rather constrain incident responders. Furthermore, depending on a specific case, analysts may need to deviate from a standard procedure based on a sound and logical judgment of a specific situation. Enterprises must be cognizant of the fact that playbooks do not replace critical thinking. Security analysts and other stakeholders that participate in the incident response process still need to apply critical thinking and demonstrate the necessary skills and experience in their respective domains.
This section discusses generic content and recommendations that enterprises may choose to include in their incident response playbooks. The content is organized by the stages of the incident response lifecycle.
Arguably, an incident response team invokes a playbook after incident detection. For this reason, specific technical means of identifying incidents are typically out of scope. However, a playbook should still include specific triage steps. Triage allows incident responders to determine whether the reported event is a cybersecurity incident and collect contextual information before diving into detailed analysis. Typical steps in this phase include the following:
Depending on the nature of an incident and its scope, analysts may acquire and examine different types of data. The purpose of the analysis phase is to determine attacker activity in the compromised environment and understand the full scope of the incident.
The analysis steps must focus on the objectives of the analysis and the type of incident that a playbook addresses. At the same time, the steps must be generic enough not to constrain the analyst. For example, when investigating a backdoor malware on a server system, an analyst may choose to focus on program execution artifacts rather than reviewing every possible forensic artifact that the system may have produced. However, including separate steps to review each program execution artifact would be excessive and impractical. Typical steps in this phase include the following:
It is crucial to emphasize that analysis is an iterative, and often incremental, process. For this reason, analysts iterate through the lifecycle until they reach diminishing returns relating to new findings, such as discovering additional compromised systems. It is essential to communicate this concept to vital stakeholders and set expectations accordingly, particularly during high-pressure incidents.
The information that analysts compile during the analysis steps drives containment and eradication activities. These activities are specific to a threat type, and their execution heavily depends on the architecture of the compromised environment, available tools, policies, and other controls. Another crucial factor that impacts containment and eradication is business priorities. For example, isolating a critical server from a network may be a sensible containment strategy from a technical perspective, but not feasible from a business point of view.
The first activity in defining containment and eradication steps as part of a playbook definition is to create a strategy and then select tools and technology to implement that strategy. A strategy to contain an incident and eradicate a threat actor may include the following:
Furthermore, enterprises can enhance security monitoring to detect additional attacker activities or attempts to come back to the compromised network after remediation.
Recovery is typically outside the scope of an incident response playbook. However, in some cases, organizations may want to include in an incident response playbook steps relating to invoking functions that typically participate in the recovery process, such as the DR function or technical groups. For example, an enterprise may choose to include a step in the recovery phase to re-image malware-infected workstations and assign the ownership of that step to a workplace services group.
In addition to the steps necessary to investigate and remediate an incident, organizations need to consider these other components relating to the overall incident management process:
The post-incident phase, also referred to as postmortem, includes activities that follow the recovery phase. Vulnerability management and lessons learned are two vital processes that can help organizations determine the root cause of an incident and the necessary improvements to prevent or reduce the impact of similar incidents in the future. The outcome of these processes typically feeds into the continual improvement process discussed in the next section.
This section briefly discusses vulnerability management as it pertains to cyber breach response. The purpose of an incident management process is to address the aftermath of a cyberattack and minimize the damage it causes to the enterprise. However, if the enterprise does not address the underlying security weaknesses that allow attackers to compromise systems in the first place, similar incident types are likely to recur. This reactive approach often leads to inefficient use of resources and reduced security posture. Vulnerability management is a process that allows enterprises to address the issue.11
As stated previously, vulnerability management is the process of identifying, evaluating, reporting, and remediating security vulnerabilities. A vulnerability is a weakness in a computer system, software application, design, implementation, or control that a threat actor can exploit to gain unauthorized access to a computer network and progress through the cyberattack lifecycle. The purpose of vulnerability management is to identify and mitigate vulnerabilities and minimize the attack surface of computer systems and software applications.
Vulnerability management is often an independent function within the overall cybersecurity program. Investigative findings and security posture weaknesses discovered during incident response must feed into the vulnerability management process to prevent similar incident types from reoccurring. In simple terms, incident management is a reactive process, whereas vulnerability management attempts to address issues proactively.
The vulnerability management lifecycle, as it pertains to incident response, consists of five steps necessary to identify and address vulnerabilities that lead to cybersecurity incidents. The lifecycle approach applies to vulnerabilities that incident responders identify during incident investigations. A comprehensive vulnerability management program contains additional stages, such as asset discovery and prioritization, that the approach does not include. Figure 4.5 depicts the lifecycle.
Chapter 1 discussed risk management as a driver for cyber breach response. In the cybersecurity domain, vulnerability management has a strong relationship with risk management.
Incident response teams should submit underlying security weaknesses that they identify as part of a root cause analysis to the risk management function for evaluation. For this reason, enterprises must integrate their vulnerability management process with risk management to determine cost-effective measures to address those weaknesses. The risk response process includes the following options:
Lessons learned is the learning that enterprises gain from responding to cybersecurity incidents. The purpose of lessons learned is to identify and document information about positive aspects of incident response, as well as to identify any roadblocks and issues. The outcome of a lessons-learned session is a set of specific action items designed to improve future response.
Action items may fall within one of the following categories: operational, tactical, or strategic. Organizations can implement operational items as part of day-to-day operations. However, tactical and strategic action items typically require funding and other resources. For this reason, they must feed into the continual improvement process. I discuss this topic in more detail in the “Continual Improvement” section.
Organizations should conduct a lessons-learned meeting after every major incident. It is also a good practice to hold regular lessons-learned meetings to evaluate the response to less severe incidents to identify actionable improvements.
Lessons learned is an iterative process consisting of several steps necessary to evaluate and improve response to cyberattacks continuously, as depicted in Figure 4.6.
People, process, and technology are critical components of effective incident response. To identify lessons learned, enterprises need to evaluate the performance of each of those components.
The people component encompasses all the stakeholders who participate in the incident response process. The following list provides examples of discussion topics that may help identify lessons learned relating to this component:
The process component is necessary to execute the incident management process in a structured and coordinated manner. The following list provides examples of discussion topics that may help identify lessons learned relating to this component:
The technology component refers to the tools and technologies necessary to respond to incidents effectively, as well as other technology considerations during incident investigations. The following list provides examples of discussion topics that may help identify lessons learned for this component:
This section contains general recommendations for conducting fruitful lessons-learned sessions.
It is critical that the person who leads the discussion makes the attendees comfortable and facilitates the discussion in a nonconfrontational way. During review meetings, stakeholders tend bring up negative experiences, so keeping the discussion balanced is vital to ensuring that stakeholders identify lessons learned as opposed to assigning blame.
Some participants may be introverted or lack the confidence to speak. Encourage them to participate. In my personal experience, some of the most valuable contributions come from those who quietly evaluate their observations before bringing them up in front of others.
Lastly, an excellent technique to facilitate a meeting is to ask open-ended questions and then follow up with specific targeted questions to extrapolate further details. Starting with open-ended questions allows the facilitator to collect general feedback before drilling into specific areas of interest.
Chapter 1 briefly discussed the need for continual improvement as a vehicle to evaluate and improve a cyber breach response program continuously. The chapter also discussed the drivers for continual improvement, as well as the organizational levels at which it can take place. This section discusses this topic in depth and provides a methodology that can help organizations successfully implement a continual improvement process.
This section presents two conceptual models that form the foundation of continual improvement: the Deming cycle and the Data, Information, Knowledge, Wisdom (DIKW) hierarchy. The former model describes quality management principles, whereas the latter model explains the principles of knowledge management.
The Deming cycle is a continuous quality improvement model that consists of a sequence of four steps: Plan, Do, Check, and Act (PDCA), as depicted in Figure 4.7. The PDCA cycle is a fundamental tenet of many quality management standards, such as ISO/IEC 27001.12
The DIKW hierarchy is a knowledge management model that describes how data evolves into information, knowledge, and wisdom, respectively, as represented in Figure 4.8.13
The ITIL seven-step improvement process builds on the improvement principles discussed in the previous section. It presents a sequence of actionable steps that can help organizations implement a continual improvement process as part of their governance.14
Figure 4.9 provides a visual representation of this process. The next section provides practical guidance on how to implement each step.
Chapter 1 discussed how to create a vision as part of the overall cyber breach response program. The same strategies apply to defining a vision for improvement.
As part of this step, enterprises need to make sure that their strategy is realistic and aligned to their risk appetite. Answering the following questions is a good start for developing a continual improvement strategy:
After identifying specific improvement initiatives, organizations must document them in a continual improvement register, assign an owner to each initiative, and review them regularly.
Metrics and key performance indicators (KPI) are vital components for measuring the performance of a program. They must be quantifiable and allow visualization of trends over time.15
Ideally, organizations need to establish metrics and KPIs for each objective established during the first step. As with the previous step, enterprises must document the metrics and KPIs that they decide to measure. The following list contains practical tips for getting started with metrics and KPIs:
A common mistake is to create a metric system that is not suitable for the target audience. For example, presenting data relating to SQL injection attacks against a web application to a board of directors is not the most effective way to communicate risk. Instead, focus on metrics that can support KPIs and communicate the intended message. For example, measuring the occurrence of specific incident types and presenting trends over a period of time helps senior management communicate risk more effectively and secure the necessary resources to keep that risk at an acceptable level.
Enterprises may also choose to create a metric database or some other index to provide metadata on metrics. The metadata can include the following elements for each metric:
This approach can help organizations understand their metric system and keep it organized.
The quality of metrics that an enterprise produces heavily depends on the data that the organization uses to produce those metrics. The characteristics that define data quality include the following:
The following list contains practical tips for ensuring data quality in support of metrics:
After the collection phase, organizations must convert raw data into information. Data processing usually consists of the following steps:
In this step, organizations extract actionable intelligence by applying business context to the information and metrics established during the previous step. Stakeholders typically review the presented information for trends and patterns to guide decision making. They may also ask specific questions relating to the program, such as these:
In this step, management assesses the analysis results and makes decisions regarding specific improvement plans. The presentation style and level of detail must be appropriate for management to enable them to make decisions. The following list contains practical tips for ensuring effective presentation:
In this step, an enterprise implements the improvement plans agreed to during the previous step. This step may include the following:
An incident response plan contains several components that must work in tandem to allow organizations to respond to cybersecurity incidents effectively.
Most incidents have a predictable lifecycle. An effective incident management process helps organizations manage incident response activities in a structured and coordinated manner and ensure that they can continue business operations during cyberattacks. It consists of several components and activities that integrate the people, process, and technology dimensions.
For specific incident types, organizations may choose to create playbooks. Playbooks allow organizations to create repeatable procedures and accelerate response activities. However, organizations need to remember that even a state-of-the-art playbook cannot replace skills, experience, and critical thinking.
Finally, to improve their cyber breach response capabilities continually, enterprises need to conduct lessons learned, identify vulnerabilities, and implement measures to prevent reoccurring incidents or reduce their impact on their business. Continual improvement is vital to successful governance, and it helps organizations mature and improve their programs incrementally.
nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-61r2.pdf
.www.iso.org/standard/75106.html
.www.us-cert.gov/sites/default/files/c3vp/crr_resources_guides/CRR_Resource:Guide-IM.pdf
.www.tsoshop.co.uk/gempdf/ITIL_Maturity_Model_SA_User_Guide_v1_2W.pdf
.cmmiinstitute.com
.www.iso.org/standard/62085.html
.usa.visa.com/dam/VCOM/download/merchants/cisp-what-to-do-if-compromised.pdf
.nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-61r2.pdf
.www.us-cert.gov/CISA-Cyber-Incident-Scoring-System
.www.iso.org/standard/54534.html
.18.216.190.167