Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Security Operations

IN THIS CHAPTER

Understanding investigations

Applying security operations concepts and controls

Responding to incidents

Preparing for disasters

Keeping facilities and personnel safe

The Security Operations domain covers lots of essential security concepts and builds on many of the other security domains, including Security and Risk Management (Chapter 3), Asset Security (Chapter 4), Security Architecture and Engineering (Chapter 5), and Communication and Network Security (Chapter 6). Security operations represents routine operations that occur across many of the CISSP domains. This domain represents 13 percent of the CISSP certification exam.

Understand and Support Investigations

Conducting investigations for various purposes is an important function for security professionals. You must understand evidence collection and handling procedures, reporting and documentation requirements, various investigative processes, and digital forensics tools and techniques. Successful conclusions in investigations depend heavily on proficiency in these skills.

Evidence collection and handling

Evidence is information presented in a court of law to confirm or dispel a fact that’s under contention, such as the commission of a crime, the violation of policy, or an ethics matter. A case can’t be brought to trial or other legal proceeding without sufficient evidence to support the case. Thus, properly gathering and protecting evidence is one of the most important and most difficult tasks that an investigator must master.

Important evidence collection and handling topics covered on the CISSP exam include the types of evidence, rules of evidence, admissibility of evidence, chain of custody, and the evidence lifecycle.

Types of evidence

Sources of legal evidence that you can present in a court of law generally fall into one of four major categories:

Direct evidence: Oral testimony or a written statement based on information gathered through a witness’s five senses (in other words, an eyewitness account) that proves or disproves a specific fact or issue.
Real (or physical) evidence: Tangible objects from the actual crime, such as the tools or weapons used and any stolen or damaged property. May also include visual or audio surveillance tapes generated during or after the event. Physical evidence from a computer crime is not always available.
Documentary evidence: Includes originals and copies of business records, computer-generated and computer-stored records, manuals, policies, standards, procedures, and log files. Most evidence presented in a computer crime case is documentary evidence. The hearsay rule (which we discuss in the section “Hearsay rule,” later in this chapter) is an extremely important test of documentary evidence that must be understood and applied to this type of evidence.
Demonstrative evidence: Used to aid the court’s understanding of a case. Opinions are considered demonstrative evidence and may be either expert (based on personal expertise and facts) or non-expert (based on facts only). Other examples of demonstrative evidence include models, simulations, charts, and illustrations.

Other types of evidence that may fall into one or more of the above major categories include

Best evidence: Original, unaltered evidence, which is preferred by the court over secondary evidence. Read more about this evidence in the section “Best evidence rule,” later in this chapter.
Secondary evidence: A duplicate or copy of evidence, such as a tape backup, screen capture, or photograph.
Corroborative evidence: Supports or substantiates other evidence presented in a case.
Conclusive evidence: Incontrovertible and irrefutable — you know, the smoking gun.
Circumstantial evidence: Relevant facts that you can’t directly or conclusively connect to other events, but about which a reasonable person can make a reasonable inference.

Rules of evidence

Important rules of evidence for computer crime cases include the best evidence rule and the hearsay evidence rule. The CISSP candidate must understand both of these rules and their applicability to evidence in computer crime cases.

BEST EVIDENCE RULE

The best evidence rule, defined in the Federal Rules of Evidence, states that “to prove the content of a writing, recording, or photograph, the original writing, recording, or photograph is [ordinarily] required.”

However, the Federal Rules of Evidence define an exception to this rule as “[i]f data are stored in a computer or similar device, any printout or other output readable by sight, shown to reflect the data accurately, is an ‘original’.”

Thus, data extracted from a computer — if that data is a fair and accurate representation of the original data — satisfies the best evidence rule and may normally be introduced into court proceedings as such.

HEARSAY RULE

Hearsay evidence is evidence that’s not based on personal, first-hand knowledge of a witness, but rather comes from other sources. Under the Federal Rules of Evidence, hearsay evidence is normally not admissible in court. This rule exists to prevent unreliable testimony from improperly influencing the outcome of a trial.

Business records, including computer records, have traditionally, and perhaps mistakenly, been considered hearsay evidence by most courts because these records cannot be proven accurate and reliable. One of the most significant obstacles for a prosecutor to overcome in a computer crime case is seeking the admission of computer records as evidence.

A prosecutor may be able to introduce computer records as best evidence, rather than hearsay evidence, which we discuss in the preceding section.

Several courts have acknowledged that the hearsay rules are applicable to computer-stored records containing human statements but are not applicable to computer-generated records untouched by human hands.

Perhaps the most successful and commonly applied test of admissibility for computer records, in general, has been the business records exception, established in the U.S. Federal Rules of Evidence, for records of regularly conducted activity, meeting the following criteria:

Made at or near the time (contemporaneously) that the act occurred.
Made by a person who has knowledge of the business process or from information transmitted by a person who has knowledge of the business process.
Made and relied on during the regular conduct of business or in the furtherance of the business, as verified by the custodian or other witness familiar with the records’ use.
Kept for motives that tend to assure their accuracy.
In the custody of the witness on a regular basis (as required by the chain of evidence).

The chain of evidence establishes accountability for the handling of evidence throughout the evidence lifecycle. See the section “Chain of custody and the evidence lifecycle” later in this chapter.

Admissibility of evidence

Because computer-generated evidence can sometimes be easily manipulated, altered, or tampered with, and because it’s not easily and commonly understood, this type of evidence is usually considered suspect in a court of law. In order to be admissible, evidence must be

Relevant: It must tend to prove or disprove facts that are relevant and material to the case.
Reliable: It must be reasonably proven that what is presented as evidence is what was originally collected and that the evidence itself is reliable. This is accomplished, in part, through proper evidence handling and the chain of custody. (We discuss this in the upcoming section “Chain of custody and the evidence lifecycle.”)
Legally permissible: It must be obtained through legal means. Evidence that’s not legally permissible may include evidence obtained through the following means:
- Illegal search and seizure: Law enforcement personnel must obtain a prior court order; however, non–law enforcement personnel, such as a supervisor or system administrator, may be able to conduct an authorized search under some circumstances.
- Illegal wiretaps or phone taps: Anyone conducting wiretaps or phone taps must obtain a prior court order.
- Entrapment or enticement: Entrapment encourages someone to commit a crime that the individual may have had no intention of committing. Conversely, enticement lures someone toward certain evidence (a honey pot, if you will) after that individual has already committed a crime. Enticement isn’t necessarily illegal, but it does raise certain ethical arguments and may not be admissible in court.
- Coercion: Coerced testimony or confessions are not legally permissible. Coercion involves compelling a person to involuntarily provide evidence through the use of threats, violence (torture), bribery, trickery, or intimidation.
- Unauthorized or improper monitoring: Active monitoring must be properly authorized and conducted in a standard manner; users must be notified that they may be subject to monitoring.

Chain of custody and the evidence lifecycle

The chain of custody (or chain of evidence) provides accountability and protection for evidence throughout its entire lifecycle and includes the following information, which is normally kept in an evidence log:

Persons involved (Who): Identify any and all individual(s) who discovered, collected, seized, analyzed, stored, preserved, transported, or otherwise controlled the evidence. Also identify any witnesses or other individuals present during any of the above actions.
Description of evidence (What): Ensure that all evidence is completely and uniquely described.
Location of evidence (Where): Provide specific information about the evidence’s location when it is discovered, analyzed, stored, or transported.
Date/Time (When): Record the date and time that evidence is discovered, collected, seized, analyzed, stored, or transported. Also, record date and time information for any evidence log entries associated with the evidence.
Methods used (How): Provide specific information about how evidence was discovered, collected, stored, preserved, or transported.

Any time that evidence changes possession or is transferred to a different media type, it must be properly recorded in the evidence log to maintain the chain of custody.

Law enforcement officials must strictly adhere to chain of custody requirements, and this adherence is highly recommended for anyone else involved in collecting or seizing evidence. Security professionals and incident response teams must fully understand and follow chain of custody principles and procedures, no matter how minor or insignificant a security incident may initially appear. In both cases, chain of custody serves to prove that digital evidence has not been modified at any point in the forensic examination and analysis.

Even properly trained law enforcement officials sometimes make crucial mistakes in evidence handling and safekeeping. Most attorneys won’t understand the technical aspects of the evidence that you may present in a case, but they will definitely know evidence-handling rules and will most certainly scrutinize your actions in this area. Improperly handled evidence, no matter how conclusive or damaging, will likely be inadmissible in a court of law.

The evidence lifecycle describes the various phases of evidence, from its initial discovery to its final disposition.

The evidence lifecycle has the following five stages:

Collection and identification
Analysis
Storage, preservation, and transportation
Presentation in court
Final disposition — for example, return to owner or destroy (if it is a copy)

The following sections explain more about each stage.

COLLECTION AND IDENTIFICATION

Collecting evidence involves taking that evidence into custody. Unfortunately, evidence can’t always be collected and must instead be seized. Many legal issues are involved in seizing computers and other electronic evidence. The publication Searching and Seizing Computers and Obtaining Evidence in Criminal Investigations (3^rd edition, 2009), published by the U.S. Department of Justice (DOJ) Computer Crime and Intellectual Property Section (CCIPS), provides comprehensive guidance on this subject. Find this publication available for download at www.justice.gov/sites/default/files/criminal-ccips/legacy/2015/01/14/ssmanual2009.pdf.

In general, law enforcement officials can search and/or seize computers and other electronic evidence under any of four circumstances:

Voluntary or consensual: The owner of the computer or electronic evidence can freely surrender the evidence.
Subpoena: A court issues a subpoena to an individual, ordering that individual to deliver the evidence to the court.
Search warrant or Anton Piller order: A search warrant is issued to a law enforcement official by the court, allowing that official to search and seize specific evidence. An Anton Piller order is a court order that allows the premises to be searched and evidence seized without prior warning, usually to prevent the possible destruction of evidence.
Exigent circumstances: If probable cause exists and the destruction of evidence is imminent, that evidence may be searched or seized without a warrant.

When evidence is collected, it must be properly marked and identified. This ensures that it can later be properly presented in court as actual evidence gathered from the scene or incident. The collected evidence must be recorded in an evidence log with the following information:

A description of the particular piece of evidence including any specific information, such as make, model, serial number, physical appearance, material condition, and preexisting damage.
The name(s) of the person(s) who discovered and collected the evidence.
The exact date and time, specific location, and circumstances of the discovery/collection.

Additionally, the evidence must be marked, using the following guidelines:

Mark the evidence: If possible without damaging the evidence, mark the actual piece of evidence with the collecting individual’s initials, the date, and the case number (if known). Seal the evidence in an appropriate container and again mark the container with the same information (see the previous bullet).
Use an evidence tag: If the actual evidence cannot be marked, attach an evidence tag with the same information as above, seal the evidence and tag in an appropriate container, and again mark the container with the same information.
Seal the evidence: Seal the container with evidence tape and mark the tape in a manner that will clearly indicate any tampering.
Protect the evidence: Use extreme caution when collecting and marking evidence to ensure that it’s not damaged. If you’re using plastic bags for evidence containers, be sure that they’re static free to protect magnetic media.

Always collect and mark evidence in a consistent manner so that you can easily identify evidence and describe your collection and identification techniques to an opposing attorney in court, if necessary.

ANALYSIS

Analysis involves examining the evidence for information pertinent to the case. Analysis should be conducted with extreme caution, by properly trained and experienced personnel only, to ensure the evidence is not altered, damaged, or destroyed.

STORAGE, PRESERVATION, AND TRANSPORTATION

All evidence must be properly stored in a secure facility and preserved to prevent damage or contamination from various hazards, including intense heat or cold, extreme humidity, water, magnetic fields, and vibration. Evidence that’s not properly protected may be inadmissible in court, and the party responsible for collection and storage may be liable. Care must also be exercised during transportation to ensure that evidence is not lost, temporarily misplaced, damaged, or destroyed.

PRESENTATION IN COURT

Evidence to be presented in court must continue to follow the chain of custody and be handled with the same care as at all other times in the evidence lifecycle. This process continues throughout the trial until all testimony related to the evidence is completed and the trial has concluded or the case is settled or dismissed.

FINAL DISPOSITION

After the conclusion of the trial or other disposition, evidence is normally returned to its proper owner. However, under some circumstances, certain evidence may be ordered destroyed, such as contraband, drugs, or drug paraphernalia. Any evidence obtained through a search warrant is legally under the control of the court, possibly requiring the original owner to petition the court for its return.

Reporting and documentation

As described in the preceding section, complete and accurate recordkeeping is critical to each investigation. An investigation’s report is intended to be a complete record of an investigation, and usually includes the following:

Incident investigators, including their qualifications and contact information.
Names of parties interviewed, including their role, involvement, and contact information.
List of all evidence collected, including chain(s) of custody.
Tools used to examine or process evidence, including versions.
Samples and sampling methodologies used, if applicable.
Computers used to examine, process, or store evidence, including a description of configuration.
Root-cause analysis of incident, if applicable.
Conclusions and opinions of investigators.
Hearings or proceedings.
Parties to whom the report is delivered.

Investigative techniques

An investigation should begin immediately upon report of an alleged computer crime, policy violation, or incident. Any incident should be handled, at least initially, as a computer crime investigation or policy violation until a preliminary investigation determines otherwise. Different investigative techniques may be required, depending upon the goal of the investigation or applicable laws and regulations. For example, incident handling requires expediency to contain any potential damage as quickly as possible. A root cause analysis requires in-depth examination to determine what happened, how it happened, and how to prevent the same thing from happening again. However, in all cases, proper evidence collection and handling is essential. Even if a preliminary investigation determines that a security incident was not the result of criminal activity, you should always handle any potential evidence properly, in case either further legal proceedings are anticipated or a crime is later uncovered during the course of a full investigation. The CISSP candidate should be familiar with the general steps of the investigative process:

Detect and contain an incident.

Early detection is critical to a successful investigation. Unfortunately, computer-related incidents usually involve passive or reactive detection techniques (such as the review of audit trails and accidental discovery), which often leave a cold evidence trail. Containment minimizes further loss or damage. The computer incident response team (CIRT), which we discuss later in this chapter, is the team that is normally responsible for conducting an investigation. The CIRT should be notified (or activated) as quickly as possible after a computer crime is detected or suspected.
Notify management.

Management must be notified of any investigations as soon as possible. Knowledge of the investigations should be limited to as few people as possible, on a need-to-know basis. Out-of-band communication methods (reporting in person) should be used to ensure that an intruder does not intercept sensitive communications about the investigation.
Conduct a preliminary investigation.

This preliminary investigation determines whether an incident or crime actually occurred. Most incidents turn out to be honest mistakes rather than malicious conduct. This step includes reviewing the complaint or report, inspecting damage, interviewing witnesses, examining logs, and identifying further investigation requirements.
Determine whether the organization should disclose that the crime occurred.

First, and most importantly, determine whether laws or regulations require the organization to disclose a crime or incident. Next, by coordinating with a public relations or public affairs official of the organization, determine whether the organization wants to disclose this information.
Conduct the investigation.

Conducting the investigation involves three activities:
1. Identify potential suspects.
  
  Potential suspects include insiders and outsiders to the organization. One standard discriminator to help determine or eliminate potential suspects is the MOM test: Did the suspect have the Motive, Opportunity, and Means? The Motive might relate to financial gain, revenge, or notoriety. A suspect had Opportunity if he or she had access, whether as an authorized user for an unauthorized purpose or as an unauthorized user — due to the existence of a security weakness or vulnerability — for an unauthorized purpose. And Means relates to whether the suspect had the necessary tools and skills to commit the crime.
2. Identify potential witnesses.
  
  Determine whom you want interviewed and who conducts the interviews. Be careful not to alert any potential suspects to the investigation; focus on obtaining facts, not opinions, in witness statements.
3. Prepare for search and seizure.
  
  Identify the types of systems and evidence that you plan to search or seize, designate and train the search and seizure team members (normally members of the Computer Incident Response Team, or CIRT), obtain and serve proper search warrants (if required), and determine potential risk to the system during a search and seizure effort.
Report your findings.

The results of the investigation, including evidence, should be reported to management and turned over to proper law enforcement officials or prosecutors, as appropriate.

MOM stands for Motive, Opportunity, and Means.

Digital forensics tools, tactics, and procedures

Digital forensics is the science of conducting a computer incident investigation to determine what has happened and who is responsible, and to collect legally admissible evidence for use in subsequent legal proceedings, such as a criminal investigation, internal investigation, or lawsuit.

Proper forensic analysis and investigation requires in-depth knowledge of hardware (such as endpoint devices and networking equipment), operating systems (including desktop, server, mobile device, and other device operating systems, like routers, switches, and load balancers), applications, databases, and software programming languages, as well as knowledge and experience using sophisticated forensics tools and toolkits.

The types of forensic data-gathering techniques include

Hard drive forensics. Here, specialized tools are used to create one or more forensically identical copies of a computer’s hard drive. A device called a write blocker is typically used to prevent any possible alterations to the original drive. Cryptographic checksums can be used to verify that a forensic copy is an exact duplicate of the original.

Tools are then used to examine the contents of the hard drive in order to determine
- Last known state of the computer
- History of files accessed
- History of files created
- History of files deleted
- History of programs executed
- History of web sites visited by a browser
- History of attempts by the user to remove evidence
Live forensics. Here, specialized tools are used to examine a running system, including:
- Running processes
- Currently open files
- Contents of main storage (RAM)
- Keystrokes
- Communications traffic in/out of the computer
Live forensics are difficult to perform, because the tools used to collect information can also affect the system being examined.

Understand Requirements for Investigation Types

The purpose of an investigation is to determine what happened and who is responsible, and to collect evidence that supports this hypothesis. Closely related to, but distinctly different from, investigations is incident management (discussed in detail later in this chapter). Incident management determines what happened, contains and assesses damage, and restores normal operations.

Investigations and incident management must often be conducted simultaneously in a well-coordinated and controlled manner to ensure that the initial actions of either activity don’t destroy evidence or cause further damage to the organization’s assets. For this reason, it’s important that Computer Incident (or Emergency) Response Teams (CIRT or CERT, or Computer Security Incident Response Teams - CSIRT, respectively) be properly trained and qualified to secure a computer-related crime scene or incident while preserving evidence. Ideally, the CIRT includes individuals who will actually be conducting the investigation.

An analogy to this would be an example of a police patrolman who discovers a murder victim. It’s important that the patrolman quickly assesses the safety of the situation and secures the crime scene, but at the same time, he must be careful not to disturb or destroy any evidence. The homicide detective’s job is to gather and analyze the evidence. Ideally, but rarely, the homicide detective would be the individual who discovers the murder victim, allowing her to assess the safety of the situation, secure the crime scene, and begin collecting evidence. Think of yourself as a CSI-SSP!

Different requirements for various investigation types include

Operational. After any damage from a security incident has been contained, operational investigations typically focus on root-cause analysis, lessons learned, and management reporting.
Criminal. Criminal investigations require strict adherence to proper evidence collection and handling procedures. The investigation is focused on discovering and preserving evidence for possible prosecution of any culpable parties.
Civil. A civil investigation may result from a data breach or regulatory violation, and typically will focus on quantifying any damage, and establishing due diligence or negligence.
Regulatory. Regulatory investigations often take the form of external, mandatory audits, and are focused on evaluating security controls and compliance.

Various industry standards and guidelines provide guidance for conducting investigations. These include the American Bar Association’s (ABA) Best Practices in Internal Investigations, various best practice guidelines and toolkits published by the U.S. Department of Justice (DOJ), and ASTM International’s Standard Practice for Computer Forensics (ASTM E2763).

Conduct Logging and Monitoring Activities

Event logging is an essential part of an organization’s IT operations. Increasingly, organizations are implementing centralized log collection systems that often serve as security information and event management (SIEM) platforms.

Intrusion detection and prevention

Intrusion detection is a passive technique used to detect unauthorized activity on a network. An intrusion detection system is frequently called an IDS. Three types of IDSs used today are

Network-based intrusion detection (NIDS): Consists of a separate device attached to a network that listens to all network traffic by using various methods (which we describe later in this section) to detect anomalous activity.
Host-based intrusion detection (HIDS): This is really a subset of network-based IDS, in which only the network traffic destined for a particular host is monitored.
Wireless intrusion detection (WIDS): This is another type of network intrusion detection that focuses on wireless intrusion by scanning for rogue access points.

Both network- and host-based IDSs use a couple of methods:

Signature-based: A signature-based IDS compares network traffic that is observed with a list of patterns in a signature file. A signature-based IDS detects any of a known set of attacks, but if an intruder is able to change the patterns that he uses in his attack, then his attack may be able to slip by the IDS without being detected. The other downside of signature-based IDS is that the signature file must be frequently updated.
Reputation-based: Closely akin to signature based, reputation-based alerting is all about detecting when communications and other activities involve known-malicious domains and IP networks. Some IDSs update themselves several times daily, including adding to a list of known-malicious domains and IP addresses. Then, when any activities are associated with a known-malicious domain or IP address, the IDS can create an alert that lets personnel know about the activity.
Anomaly-based: An anomaly-based IDS monitors all the traffic over the network and builds traffic profiles. Over time, the IDS will report deviations from the profiles that it has built. The upside of anomaly-based IDSs is that there are no signature files to periodically update. The downside is that you may have a high volume of false-positives. Behavior-based and heuristics-based IDSs are similar to anomaly-based IDSs and share many of the same advantages. Rather than detecting anomalies to normal traffic patterns, behavior-based and heuristics-based systems attempt to recognize and learn potential attack patterns.

Intrusion detection doesn’t stop intruders, but intrusion prevention does … or, at least, it slows them down. Intrusion prevention systems (IPSs) are newer and more common systems than IDSs, and IPSs are designed to detect and block intrusions. An intrusion prevention system is simply an IDS that can take action, such as dropping a connection or blocking a port, when an intrusion is detected.

Intrusion detection looks for known attacks and/or anomalous behavior on a network or host.

See Chapter 6 for more on intrusion detection and intrusion prevention systems.

Security information and event management

Security information and event management (SIEM) solutions provide real-time collection, analysis, correlation, and presentation of security logs and alerts generated by various network sources (such as firewalls, IDS/IPS, routers, switches, servers, and workstations).

A SIEM solution can be software- or appliance-based, and may be hosted and managed either internally or by a managed security service provider.

A SIEM requires a lot of up-front configuration and tuning, so that only the most important, actionable events are brought to the attention of staff members in the organization. However, it’s worth the effort: a SIEM combs through millions, or billions, of events daily, and presents only the most important few, actionable events so that security teams can take appropriate action.

Many SIEM platforms also have the ability to accept threat intelligence feeds from various vendors including the SIEM manufacturers. This permits the SIEM to automatically adjust its detection and blocking capabilities for the most up-to-date threats.

Continuous monitoring

Continuous monitoring technology collects and reports security data in near real time. Continuous monitoring components may include

Discovery: Ongoing inventory of network and information assets, including hardware, software, and sensitive data.
Assessment: Automatic scanning and baselining of information assets to identify and prioritize vulnerabilities.
Threat intelligence: Feeds from one or more outside organizations that produce high-quality, actionable data.
Audit: Nearly real-time evaluation of device configurations and compliance with established policies and regulatory requirements.
Patching: Automatic security patch installation and software updating.
Reporting: Aggregating, analyzing and correlating log information and alerts.

Egress monitoring

Egress monitoring (or extrusion detection) is the process of monitoring outbound traffic to discover potential data leakage (or loss). Modern cyberattacks employ various stealth techniques to avoid detection as long as possible for the purpose of data theft. These techniques may include the use of encryption (such as SSL/TLS) and steganography (discussed in Chapter 4).

Data loss prevention (DLP) systems are often used to detect the exfiltration of sensitive data, such as personally identifiable information (PII) or protected health information (PHI), in e-mail messages, data uploads, PNG or JPEG images, and other forms of communication. These technologies often perform deep packet inspection (DPI) to decrypt and inspect outbound traffic that is TLS encrypted.

DLP systems can also be used to disable removable media drive interfaces on servers and workstations, and also to encrypt data written onto removable media.

Static DLP tools are used to discover sensitive and proprietary data in databases, file servers, and other data storage systems.

Securely Provisioning Resources

An organization’s information architecture is dynamic and constantly changing. As a result, its security posture is also dynamic and constantly changing. Provisioning (and decommissioning) of various information resources can have significant impacts (both direct and indirect) on the organization’s security posture. For example, an application may either directly introduce new vulnerabilities into an environment or integrate with a database in a way that compromises the integrity of the database. For these reasons, security planning and analysis must be an integral part of every organization’s resource provisioning processes, as well as throughout the lifecycle of all resources. Important security considerations include

Asset inventory. Maintaining a complete and accurate inventory is critical to ensure that all potential vulnerabilities and risks in an environment can be identified, assessed, and addressed. Indeed, so many other critical security processes are dependent upon sound asset inventory that asset inventory is one of the most important (but mundane) activities in IT organizations.

Asset inventory is the first control found in the well-known CIS 20 Controls. Asset inventory appears first because it is fundamental to most other security controls.
Change management. Change management processes are used to strictly control changes to systems in production environments, so that only duly requested and approved changes are made.
Configuration management. Configuration management processes need to be implemented and strictly enforced to ensure information resources are operated in a safe and secure manner. Organizations typically implement an automated configuration management database (CMDB) that is part of a system configuration management system used for managing asset inventory data. It’s also often used to manage the configuration history of systems.
Physical assets. Physical assets must be protected against loss, damage, or theft. Valuable or sensitive data stored on a physical asset may far exceed the value of the asset itself.
Virtual assets. Virtual machine sprawl has increasingly become an issue for organizations with the popularity of virtualization technology and software defined networks (SDN). Virtual machines (VMs) can be (and often are) provisioned in a matter of minutes, but aren’t always properly decommissioned when they are no longer needed. Dormant VMs often aren’t backed up, and can go unpatched for many months This exposes the organization to increased risk from unpatched security vulnerabilities.

Of particular concern to security professionals is the implementation of VMs without proper review and approvals. This was not a problem before virtualization, as organizations had other checks and balances in place to prevent the implementation of unauthorized systems (namely, the purchasing process). But VM’s can be implemented unilaterally, often without the knowledge or involvement of other personnel within the organization.
Cloud assets. As more organizations adopt cloud strategies that include software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS) solutions, it’s important to keep track of these assets. Ultimately, an organization is responsible for the security and privacy of its applications and data — not the cloud service provider. Issues of data residency and trans-border data flow need to be considered.

A new class of security tools known as cloud access security brokers (CASB) can detect access to, and usage of, cloud-based services. These tools give the organization more visibility into its sanctioned and unsanctioned use of cloud services. Many CASB systems, in cooperation with cloud services, can be used to control the use of cloud services.
Applications. These include both commercial and custom applications, private clouds, web services, software as a service (SaaS), and the interfaces and integrations between application components. Securing the provisioning of these assets requires strict access controls; only designated administrators should be able to deploy and configure them.

Understand and Apply Foundational Security Operations Concepts

Fundamental security operations concepts that need to be well understood and managed include the principles of need-to-know and least privilege, separation of duties and responsibilities, monitoring of special privileges, job rotation, information lifecycle management and service-level agreements.

Need-to-know and least privilege

The concept of need-to-know states that only people with a valid business justification should have access to specific information or functions. In addition to having a need-to-know, an individual must have an appropriate security clearance level in order for access to be granted. Conversely, an individual with the appropriate security clearance level, but without a need-to-know, should not be granted access.

One of the most difficult challenges in managing need-to-know is the use of controls that enforce need-to-know. Also, information owners need to be able to distinguish I need-to-know from I want-to-know, I-want-to-feel-important, and I’m-just-curious.

Need-to-know is closely related to the concept of least privilege and can help organizations implement least privilege in a practical manner.

The principle of least privilege states that persons should have the capability to perform only the tasks (or have access to only the data) that are required to perform their primary jobs, and no more.

To give an individual more privileges and access than required invites trouble. Offering the capability to perform more than the job requires may become a temptation that results, sooner or later, in an abuse of privilege.

For example, giving a user full permissions on a network share, rather than just read and modify rights to a specific directory, opens the door not only for abuse of those privileges (for example, reading or copying other sensitive information on the network share) but also for costly mistakes (accidentally deleting a file — or the entire directory!). As a starting point, organizations should approach permissions with a “deny all” mentality, then add needed permissions as required.

Least privilege is also closely related to separation of duties and responsibilities, described in the following section. Distributing the duties and responsibilities for a given job function among several people means that those individuals require fewer privileges on a system or resource.

The principle of least privilege states that people should have the fewest privileges necessary to allow them to perform their tasks.

Several important concepts associated with need to know and least privilege include

Entitlement. When a new user account is provisioned in an organization, the permissions granted to that account must be appropriate for the level of access required by the user. In too many organizations, human resources simply instructs the IT department to give a new user “whatever so-and-so (another user in the same department) has access to”. Instead, entitlement needs to be based on the principle of least privilege.
Aggregation. When people transfer between jobs and/or departments within an organization (see the section on job rotations later in this chapter), they often need different access and privileges to do their new jobs. Far too often, organizational security processes do not adequately ensure that access rights which are no longer required by an individual are actually revoked. Instead, individuals accumulate privileges, and over a period of many years an employee can have far more access and privileges than they actually need. This is known as aggregation, and it’s the antithesis of least privilege!

Privilege creep is another term commonly used here.
Transitive trust. Trust relationships (in the context of security domains) are often established within, and between, organizations to facilitate ease of access and collaboration. A trust relationship enables subjects (such as users or processes) in one security domain to access objects (such as servers or applications) in another security domain (see Chapter 5 and Chapter 7 to learn more about objects and subjects). A transitive trust extends access privileges to the subdomains of a security domain (analogous to inheriting permissions to subdirectories within a parent directory structure). Instead, a nontransitive trust should be implemented by requiring access to each subdomain to be explicitly granted based on the principle of least privilege, rather than inherited.

Separation of duties and responsibilities

The concept of separation (or segregation) of duties and responsibilities ensures that no single individual has complete authority and control of a critical system or process. This practice promotes security in the following ways:

Reduces opportunities for fraud or abuse: In order for fraud or abuse to occur, two or more individuals must collude or be complicit in the performance of their duties.
Reduces mistakes: Because two or more individuals perform the process, mistakes are less likely to occur or mistakes are more quickly detected and corrected.
Reduces dependence on individuals: Critical processes are accomplished by groups of individuals or teams. Multiple individuals should be trained on different parts of the process (for example, through job rotation, discussed in the following section) to help ensure that the absence of an individual doesn’t unnecessarily delay or impede successful completion of a step in the process.

Here are some common examples of separation of duties and responsibilities within organizations:

A bank assigns the first three numbers of a six-number safe combination to one employee and the second three numbers to another employee. A single employee isn’t permitted to have all six numbers, so a lone employee is unable to gain access to the safe and steal its contents.
An accounting department might separate record entry and internal auditing functions, or accounts payable and check disbursing functions.
A system administrator is responsible for setting up new accounts and assigning permissions, which a security administrator then verifies.
A programmer develops software code, but a separate individual is responsible for testing and validation, and yet another individual is responsible for loading the code on production systems.
Destruction of classified materials may require two individuals to complete or witness the destruction.
Disposal of assets may require an approval signature by the office manager and verification by building security.

In smaller organizations, separation of duties and responsibilities can be difficult to implement because of limited personnel and resources.

Privileged account management

Privileged entity controls are the mechanisms, generally built into computer operating systems and network devices, that give privileged access to hardware, software, and data. In UNIX and Windows, the controls that permit privileged functions reside in the operating system. Operating systems for servers, desktop computers, and many other devices use the concept of modes of execution to define privilege levels for various user accounts, applications, and processes that run on a system. For instance, the UNIX root account and Windows Server Enterprise, Domain, and Local Administrator account roles have elevated rights that allow those accounts to install software, view the entire file system and, in some cases, directly access the OS kernel and memory.

Specialized tools are used to monitor and record activities performed by privileged and administrative users. This helps to ensure accountability on the part of each administrator and aids in troubleshooting, through the ability to view actions performed by administrators.

System or network administrators typically use privileged accounts to perform operating system and utility management functions. Supervisor or Administrator mode should be used only for system administration purposes. Unfortunately, many organizations allow system and network administrators to use these privileged accounts or roles as their normal user accounts even when they aren't doing work which requires this level of access. Yet another horrible security practice is to allow administrators to share a single “administrator” or “root” account.

System or network administrators occasionally grant root or administrator privileges to normal applications as a matter of convenience, rather than spending the time to figure out exactly what privileges the application actually requires, and then creating an account role for the application with only those privileges. Allowing a normal application these privileges is a serious mistake because applications that run in privileged mode bypass some or all security controls, which could lead to unexpected application behavior. For instance, any user of a payroll application could view or change anyone's data because the application running in privileged mode was never told no by the operating system. Further, if an application running in privileged mode is compromised by an attacker, the attacker may then inherit privileged access for the entire system.

Hackers specifically target Supervisor and other privileged modes, because those modes have a great deal of power over systems. The use of Supervisor mode should be limited wherever possible, especially on end-user workstations.

MONITORING — EVERYBODY’S SPECIAL!

Monitoring the activities of an organization’s users, particularly those who have special (for example, administrator) privileges, is an important security operations practice.

User monitoring can include casual or direct observation, analysis of security logs, inspection of workstation hard drives, random drug testing (in certain job functions and in accordance with applicable laws), audits of attendance and building access records, review of call logs and transcripts, and other activities.

User monitoring, and its purposes, should be fully addressed in an organization’s written policy manuals. Information systems should display a login warning that clearly informs the user that their activities may be monitored and for what purposes. The login warning should also clearly indicate who owns the information and information assets processed on the system or network, and that the user has no expectation of privacy with regard to information stored or processed on the system. The login process should require users to affirmatively acknowledge the login warning by clicking OK or I Agree in order to gain access to the system.

An organization should conduct user monitoring in accordance with its written policies and applicable laws. Also, only personnel authorized to do so (such as security, legal, or human resources) should perform this monitoring, and only for authorized purposes. User and entity behavior analytics (UEBA) is a process that is helpful in detecting potential breaches, intrusions, or other malicious activity using monitoring data to establish baselines of normal behavior or activity and analyzing anomalies.

Job rotation

Job rotation (or rotation of duties) is another effective security control that gives many benefits to an organization. Similar to the concept of separation of duties and responsibilities, job rotations involve regularly (or randomly) transferring key personnel into different positions or departments within an organization, with or without notice. Job rotations accomplish several important organizational objectives:

Reduce opportunities for fraud or abuse. Regular job rotations can accomplish this objective in the following two ways:
- People hesitate to set up the means for periodically or routinely stealing corporate information because they know that they could be moved to another shift or task at almost any time.
- People don’t work with each other long enough to form collusive relationships that could damage the company.
Eliminate single points of failure. By ensuring that numerous people within an organization or department know how to perform several different job functions, an organization can reduce dependence on individuals and thereby eliminate single points of failure when an individual is absent, incapacitated, no longer employed with the organization, or otherwise unavailable to perform a critical job function.
Promote professional growth. Through cross-training opportunities, job rotations can help an individual’s professional growth and career development, and reduce monotony and/or fatigue.

Job rotations can also include changing workers’ workstations and work locations, which can also keep would-be saboteurs off balance and less likely to commit.

As with the practice of separation of duties, small organizations can have difficulty implementing job rotations.

MANDATORY AND PERMANENT VACATIONS: JOB ROTATIONS OF A DIFFERENT SORT!

Mandatory vacations and termination of employment are two important security operations topics that warrant a few paragraphs! You might think of a mandatory vacation as a very short (one or two week) job rotation — and a termination as a permanent vacation!

Requiring employees to take one or more weeks of their vacation in a single block of time gives an organization an opportunity to uncover potential fraud or abuse. Employees engaging in illegal or prohibited activities are sometimes reluctant to be away from the office, concerned that these activities will be discovered in their absence. This may occur as a result of an actual audit or investigation, or when someone else performing that person’s normal day-to-day functions in their absence uncovers an irregularity. Less ominously, mandatory vacations may help in other ways:

Reduce individual stress and therefore reduce opportunities for mistakes or coercion by others.
Discover inefficient processes when a substitute performs a job function more quickly or discovers a better way to get something done.
Reveal single points of failure, shadow processes, and opportunities for job rotation (and separation of duties and responsibilities) when a process or job function idles because the only person who knows how to perform that function is lying on a beach somewhere.

Finally, it is vital to lock down or revoke local and remote access for a terminated employee as soon as possible, especially in cases where the employee is being fired or laid off. The potential consequences associated with continued access by an angry employee are serious enough to warrant emergency procedures for immediate termination of access.

Information lifecycle

The information lifecycle refers to the activities related to the introduction, use, and disposal of information in an organization. The phases in the information lifecycle typically are

Plan. Development of formal plans on how to create and use information.
Creation. Information is created, collected, received, or captured in some way.
Store. Information is stored in an information system.
Use. Information is used, maintained, and perhaps disseminated.
Protection. Information is protected according to its criticality and sensitivity.
Disposal. Information at the end of its service life is discarded. Sensitive information will be erased using techniques to prevent its recovery.

The European Union’s General Data Protection Regulation (GDPR) and other privacy regulations bring to light the steps in the information lifecycle, by giving data subjects legal rights regarding use of information about them.

Service-level agreements

Users of business- or mission-critical information systems need to know whether their systems or services will function when they need them, and users need to know more than “Is it up?” or “Is it down again?” Their customers, and others, hold users accountable for getting their work done in a timely and accurate manner, so consequently, those users need to know whether they can depend on their systems and services to help them deliver as promised.

The service-level agreement (SLA) is a quasi-legal document (it’s a real legal document when it is included in or referenced by a contract) that pledges the system or service performs to a set of minimum standards, such as

Hours of availability: The wall-clock hours that the system or service will be available for users. This could be 24 x 7 (24 hours per day, 7 days per week) or something more limited, such as daily from 4:00 a.m. to 12:00 p.m. Availability specifications may also cite maintenance windows (for instance, Sundays from 2:00 a.m. to 4:00 a.m.) when users can expect the system or service to be down for testing, upgrades, and maintenance.
Average and peak number of concurrent users: The maximum number of users who can use the system or service at the same time.
Transaction throughput: The number of transactions that the system or service can perform or support in a given time period. Usually, throughput is expressed as transactions per second, per minute, or per hour.
Transaction accuracy: The accuracy of transactions that the system or service performs. Generally, this is related to complex calculations (such as calculating sales tax) and accuracy of location data.
Data storage capacity: The amount of data that the users can store in the system or service (such as cloud storage). Capacity may be expressed in raw terms (megabytes or gigabytes) or in numbers of transactions.
Response times: The maximum periods of time (in seconds) that key transactions take. Response times for long processes (such as nightly runs, batch jobs, and so on) also should be covered in the SLA.
Service desk response and resolution times: The amount of time (usually in hours) that a service desk (or help desk) will take to respond to requests for support and resolve any issues.
Mean Time Between Failures (MTBF): The amount of time, typically measured in (thousands of) hours, that a component (such as a server hard drive) or system is expected to continuously operate before experiencing a failure.
Mean Time to Restore Service (MTRS): The amount of time, typically measured in minutes or hours, that it is expected to take in order to restore a system or service to normal operation after a failure has occurred.
Security incident response times: The amount of time (usually in hours or days) between the realization of a security incident and any required notifications to data owners and other affected parties.
Escalation process during times of failure: When things go wrong, how quickly the service provider will contact the customer, as well as what steps the provider will take to restore service.

Availability is one of the three tenets of information security (Confidentiality, Integrity, and Availability, discussed in Chapter 3). Therefore, SLAs are an important security document.

Because the SLA is a quantified statement, the service provider and the user alike can take measurements to see how well the service provider is meeting the SLA’s standards. This measurement, which is sometimes accompanied by analysis, is frequently called a scorecard.

Operational-level agreements (OLAs) and underpinning contracts (UCs) are important SLA supporting documents. An OLA is essentially an SLA between the different interdependent groups that are responsible for the terms of the SLA, for example, a Service Desk and the Desktop Support team. UCs are used to manage third-party relationships with entities that help support the SLA, such as an external service provider or vendor.

Finally, for an SLA to be meaningful, it needs to have teeth! How will the SLA be enforced, and what will happen when violations occur? What are the escalation procedures? Will any penalties or service credits be paid in the event of a violation? If so, how will penalties or credits be calculated?

Internal SLAs (and OLAs), such as those between an IT department and their users, typically don’t provide penalties or service credits for service violations. Internal SLAs are structured more as a commitment between IT and the user community, and are useful for managing service expectations. Clearly defined escalation procedures (who gets notified of a problem; when, how, and when it goes up the chain of command) are critical in an internal SLA.

SLAs rarely, if ever, provide meaningful financial penalties for service violations. For example, an hour of Internet downtime might legitimately cost an e-commerce company $10,000 of business. But most service providers will typically only provide a credit equivalent to the amount paid for the lost hour of Internet service (a few hundred dollars). This may seem incredibly disproportionate, but consider it from the service provider’s perspective. That same credit has to be given to all of their customers that experienced the outage. Thus, an outage could potentially cost the service provider hundreds of thousands of dollars. If service providers were legally obligated to reimburse every customer for their actual losses, it’s fair to guess that no one would be in the business of providing Internet service (or it would cost a few thousand dollars a month for a T-l circuit). Instead, look for such penalties as an early termination clause that lets you get out of a long-term contract if your service provider repeatedly fails to meet its service level obligations.

HOW MANY NINES?

Availability is often expressed in a percentage of uptime, usually in terms of “how many nines.” In other words, an application, server, or site may be available 99 percent of the time, 99.9 percent of the time, or as much as 99.999 percent of the time. Approximate amounts of downtime per year are shown in the table.

Percentage	Number of Nines	Downtime per year (24/7/365)
99%	Two	88 hours
99.9%	Three	9 hours
99.99%	Four	53 minutes
99.999%	Five	5 minutes

Apply Resource Protection Techniques

Resource protection is the broad category of controls that protect information assets and information infrastructure. Resources that require protection include

Communications hardware and software: Routers, switches, firewalls, load balancers, intrusion prevention systems, fax machines, Virtual Private Network (VPN) servers, and so on, as well as the software that these devices use.
Computers and their storage systems: All corporate servers and client workstations, storage area networks (SANs), network-attached storage (NAS), direct-attached storage (DAS), near-line and offline storage systems, cloud-based storage, and backup devices.
Business data: All stored information, such as financial data, sales and marketing information, personnel and payroll data, customer and supplier data, proprietary product or process data, and intellectual property.
System data: Operating systems, utilities, user IDs and password files, audit trails, and configuration files.
Backup media: Tapes, tape cartridges, removable disks, and off-site replicated disk systems.
Software: Application source code, programs, tools, libraries, vendor software, and other proprietary software.

Media management

Media management refers to a broad category of controls that are used to manage information classification and physical media. Data classification refers to the tasks of marking information according to its sensitivity, as well as the subsequent handling, storage, transmission, and disposal procedures that accompany each classification level. Physical media is similarly marked; likewise, controls specify handling, storage, and disposal procedures. See Chapter 4 to learn more about data classification.

Sensitive information such as financial records, employee data, and information about customers must be clearly marked, properly handled and stored, and appropriately destroyed in accordance with established organizational policies, standards, and procedures:

Marking: How an organization identifies sensitive information, whether electronic or hard copy. For example, a marking might read PRIVILEGED AND CONFIDENTIAL. See Chapter 4 for a more detailed discussion of data classification.
Handling: The organization should have established procedures for handling sensitive information. These procedures detail how employees can transport, transmit, and use such information, as well as any applicable restrictions.
Protection: This involves two components:
- The physical protection of the actual media, such as locked cabinets and secured vehicles.
- The logical protection of information on media, such as encryption.
Storage and Backup: Similar to handling, the organization must have procedures and requirements specifying how sensitive information must be stored and backed up.
Retention: Most organizations are bound by various laws and regulations to collect and store certain information, as well as to keep it for minimum and/or maximum specified periods of time. An organization must be aware of legal requirements and ensure that it’s in compliance with all applicable regulations. Records retention policies should cover any electronic records that may be located on file servers, document management systems, databases, e-mail systems, archives, and records management systems, as well as paper copies and backup media stored at off-site facilities. Organizations that want to retain information longer than required by law should firmly establish why such information should be kept longer. Nowadays, just having information can be a liability, so this should be the exception rather than the norm.
Destruction: Sooner or later, an organization must destroy sensitive information. The organization must have procedures detailing how to destroy sensitive information that has been previously retained, regardless of whether the data is in hard copy or saved as an electronic file.

At the opposite end of the records retention spectrum, many organizations now destroy records (including backup media) as soon as legally permissible in order to limit the scope (and cost) of any future discovery requests or litigation. Before implementing any such draconian retention policies that severely restrict your organization’s retention periods, you should fully understand the negative implications such a policy has for your disaster recovery capabilities. Also, consult with your organization’s legal counsel to ensure that you’re in full compliance with all applicable laws and regulations. Although extremely short retention policies and practices may be prudent for limiting future discovery requests or litigation, they’re illegal for limiting pending discovery requests or litigation (or even records that you have a reasonable expectation may become the subject of future litigation). In such cases, don’t destroy pertinent records — otherwise you go to jail. You go directly to jail! You don’t pass Go, you don’t collect $200, and (oh, yeah) you don’t pass the CISSP exam, either — or even remain eligible for CISSP certification!

Hardware and software asset management

Maintaining a complete and accurate inventory with configuration information about all of an organization’s hardware and software information assets is an important security operations function.

Without this information, managing vulnerabilities becomes a truly daunting challenge. With popular trends such as “bring your own device” becoming more commonplace in many organizations, it is critical that organizations work with their information security leaders and end users to ensure that all devices and applications that are used are known to — and appropriately managed by — the organization. This allows any inherent risks to be known — and addressed.

Conduct Incident Management

The formal process of detecting, responding to, and fixing a security problem is known as incident management (also known as security incident management).

Do not confuse the concept of incident management, described herein, with the more general concept of incident management as defined by the Information Technology Infrastructure Library’s (ITIL) Service Management best practices.

Incident management includes the following steps:

Preparation. Incident management begins before an incident actually occurs. Preparation is the key to quick and successful incident management. A well-documented and regularly practiced incident management (or incident response) plan ensures effective preparation. The plan should include:
- Response procedures: Include detailed procedures that address different contingencies and situations.
- Response authority: Clearly define roles, responsibilities, and levels of authority for all members of the Computer Incident Response Team (CIRT).
- Available resources: Identify people, tools, and external resources (consultants and law enforcement agents) that are available to the CIRT. Training should include use of these resources, when possible.
- Legal review: The incident response plan should be evaluated by appropriate legal counsel to determine compliance with applicable laws and to determine whether they’re enforceable and defensible.
Detection. Detecting that a security incident or event has occurred is the first and, often, most difficult step in incident management. Detection may occur through automated monitoring and alerting systems, or as the result of a reported security incident (such as a lost or stolen mobile device). Under the best of circumstances, detection may occur in real-time as soon as a security incident occurs, such as malware that is discovered by anti-malware software on a computer. More often, a security incident may not be detected for quite some time (months or years), such as in the case of a sophisticated “low and slow” cyberattack. Determining whether a security incident has occurred is similar to the detection and containment step in the investigative process (discussed earlier in this chapter) and includes defining what constitutes a security incident for your organization.
Response. Upon determination that an incident has occurred, it’s important to immediately begin detailed documentation of every action taken throughout the incident management process. You should also identify the appropriate alert level. (Ask questions such as “Is this an isolated incident or a system-wide event?” and “Has personal or sensitive data been compromised?” and “What laws may have been violated?”) The answers will help you determine who to notify and whether or not to activate the entire incident response team or only certain members. Next, notify the appropriate people about the incident — both incident response team members and management. All contact information should be documented before an incident, and all notifications and contacts during an incident should be documented in the incident log.
Mitigation. The purpose of this step is to contain the incident and minimize further loss or damage. For example, you may need to eradicate a virus, deny access, or disable services in order to halt the incident in progress.
Reporting. This step requires assessing the incident and reporting the results to appropriate management personnel and authorities (if applicable). The assessment includes determining the scope and cause of damage, as well as the responsible (or liable) party.
Recovery. Recovering normal operations involves eradicating any components of the incident (for example, removing malware from a system or disabling e-mail service on a stolen mobile device). Think of recovery as returning a system to its pre-incident state.
Remediation. Remediation may include rebuilding systems, repairing vulnerabilities, improving safeguards, and restoring data and services. Do this step in accordance with a business continuity plan (BCP) that properly identifies recovery priorities.
Lessons learned. The final phase of incident management requires evaluating the effectiveness of your incident management plan and identifying any lessons learned — which should include not only what went wrong, but also what went right.

Investigations and incident management follow similar steps but have different purposes: The distinguishing characteristic of an investigation is the gathering of evidence for possible prosecution, whereas incident management focuses on containing the damage and returning to normal operations.

Operate and Maintain Detective and Preventive Measures

Detective and preventive security measures include various security technologies and techniques, including:

Firewalls. Firewalls are typically deployed at the network or data center perimeter and at other network boundaries, such as between zones of trust. Increasingly, host-based firewalls are being deployed to protect endpoints and virtual servers throughout the data center. Firewalls are discussed in more detail in Chapter 6.
Intrusion detection and prevention systems (IDS/IPS). Intrusion detection systems passively monitor traffic in a network segment or to and from a host and provide alerts of suspicious activity. An intrusion prevention system (IPS) can detect and either block an attack or drop the network packets from the attack source. IDS and IPS are discussed earlier in this chapter and in Chapter 6.
Whitelisting and blacklisting. Whitelisting involves explicitly allowing some action, such as email delivery from a known sender, traffic from a specific IP address range, or execution of a trusted application. Blacklisting explicitly blocks specific actions.
Third-party security services. Third-party security services cover a wide spectrum of possible security services, such as
- Managed security services (MSS), which typically involves a service provider that monitors an organization’s IT environment for malfunctions and incidents. Service providers can also perform management of infrastructure devices, such as network devices and servers.
- Vulnerability management services, where a service provider periodically scans internal and external networks, then reports vulnerabilities back to the customer organization for remediation.
- Security information and event management (SIEM, discussed earlier in this chapter).
- IP reputation services, usually in the form of a threat intelligence feed to an organization’s IDSs, IPSs, and firewalls.
- Web filtering, where an on-premises appliance or a cloud-based service limits or blocks end-user access to banned categories of websites (think gambling or pornography), as well as websites known to contain malicious software.
- Cloud-based malware detection, offered as a service that provides real-time scanning of files in the cloud and leverages the speed and scale of the cloud to detect and prevent zero-day threats more quickly than traditional on-premises antimalware solutions.
- Cloud-based spam filtering, offered as a service that blocks or quarantines spam and phishing emails before it reaches the corporate network, thereby significantly reducing the volume of email traffic and performance overhead associated with transmitting and processing unwanted and potentially malicious email.
- DDoS mitigation, typically deployed in an upstream network to drop or reroute DDoS traffic before it impacts the customer’s network, systems, and applications.
Sandboxing. A sandbox enables untrusted or unknown programs to be executed in a separate, isolated operating environment, so any security threats or vulnerabilities can be safely analyzed. Sandboxing is used in many types of systems today, including anti-malware, web filtering, and intrusion prevention systems (IPS).
Honeypots and honeynets. A honeypot is a decoy system that is used to attract attackers, so their methods and techniques can be observed (somewhat like a trojan horse for the good guys!). A honeynet is a network of honeypots.
Anti-malware. Anti-malware (also known as antivirus) software intercepts operating system routines that store and open files. The anti-malware software compares the contents of the file being opened or stored against a database of malware signatures. If a malware signature is matched, the anti-malware software prevents the file from being opened or saved and (usually) alerts the user. Enterprise anti-malware software typically sends an alert to a central management console so that the organization’s security team is alerted and can take the appropriate action. Advanced anti-malware tools use various advanced techniques such as machine learning to detect and block malware from executing on a system.

Implement and Support Patch and Vulnerability Management

Software bugs and flaws inevitably exist in operating systems, database management systems, and various applications, and are continually discovered by researchers. Many of these bugs and flaws are security vulnerabilities that could permit an attacker to control a target system and subsequently access sensitive data or critical functions. Patch and vulnerability management is the process of regularly assessing, testing, installing and verifying fixes and patches for software bugs and flaws as they are discovered.

To perform patch and vulnerability management, follow these basic steps:

Subscribe to security advisories from vendors and third-party organizations.
Perform periodic security scans of internal and external infrastructure to identify systems and applications with unsecure configuration and missing patches.
Perform risk analysis on each advisory and missing patch to determine its applicability and risk to your organization.
Develop a plan to either install the security patch or to perform another workaround, if any is available.

You should base your decision on which solution best eliminates the vulnerability or reduces risk to an acceptable level.
Test the security patch or workaround in a test environment.

This process involves making sure that stated functions still work properly and that no unexpected side-effects arise as a result of installing the patch or workaround.
Install the security patch in the production environment.
Verify that the patch is properly installed and that systems still perform properly.
Update all relevant documentation to include any changes made or patches installed.

Understand and Participate in Change Management Processes

Change management is the business process used to control architectural and configuration changes in a production environment. Instead of just making changes to systems and the way that they relate to each other, change management is a formal process of request, design, review, approval, implementation, and recordkeeping.

Configuration Management is the closely related process of actively managing the configuration of every system, device, and application and then thoroughly documenting those configurations.

Change Management is the approval-based process that ensures that only approved changes are implemented.
Configuration Management is the control that records all of the soft configuration (settings and parameters in the operating system, database, and application) and software changes that are performed with approval from the Change Management process.

Implement Recovery Strategies

Developing and implementing effective backup and recovery strategies are critical for ensuring the availability of systems and data. Other techniques and strategies are commonly implemented to ensure the availability of critical systems, even in the event of an outage or disaster.

Backup storage strategies

Backups are performed for a variety of reasons that center around a basic principle: sometimes things go wrong and we need to get our data back. In order to cover all reasonable scenarios, backup storage strategies often involve the following:

Secure offsite storage. Store backup media at a remote location, far enough away so that the remote location is not directly affected by the same events (weather, natural disasters, man-made disasters), but close enough so that backup media can be retrieved in a reasonable period of time.
Transport via secure courier. This can discourage or prevent theft of backup media while it is in transit to a remote location.
Backup media encryption. This helps to prevent any unauthorized third party from being able to recover data from backup media.
Data replication. Sending data to an offsite or remote data center, or cloud-based storage provider, in near real-time.

Recovery site strategies

These include hot sites (a fully functional data center or other facility that is always up and ready with near real-time replication of production systems and data), cold sites (a data center or facility that may have some recovery equipment available but not configured, and no backup data onsite), and warm sites (some hardware and connectivity is prepositioned and configured, plus an offsite copy of backup data).

Selecting a recovery site strategy has everything to do with cost and service level. The faster you want to recover data processing operations in a remote location, the more you will have to spend in order to build a site that is “ready to go” at the speed you require.

In a nutshell: Speed costs.

Multiple processing sites

Many large organizations operate multiple data centers for critical systems with real-time replication and load balancing between the various sites. This is the ultimate solution for large commercial sites that have little or no tolerance for downtime. Indeed, a well-engineered multi-site application can suffer even significant whole-data-center outages without customers even knowing anything is wrong.

System resilience, high availability, quality of service, and fault tolerance

System resilience, high availability, quality of service (QoS), and fault tolerance are similar characteristics that are engineered into a system to make it as reliable as possible:

System resilience. This includes eliminating single points of failure in system designs and building fail-safes into critical systems.
High availability. This typically consists of clustered systems and databases configured in an active-active (both systems are running and immediately available) or active-passive (one system is active, while the other is in standby but can become active, usually within a matter of seconds). Clusters in active-passive mode have the failover mechanism used to automatically switch the “active” role from one server in the cluster to another.
Quality of service. Refers to a mechanism where systems that provide various services prioritize certain services to ensure they’re always available or perform at a certain level. For example, Voice over Internet Protocol (VoIP) systems typically are prioritized to ensure sufficient network bandwidth is always available to avoid any traffic delay or degradation of voice quality. Other services that are not as sensitive to delays (such as web browsing or file downloads) will be prioritized at a lower level in such cases.
Fault tolerance. This includes engineered redundancies in critical components, such as multiple power supplies, multiple network interfaces, and RAID (redundant array of independent disks) configured storage systems.

HOW VIRTUALIZATION MAKES HIGH-AVAILABILITY A REALITY

Server virtualization is a rapidly growing and popular trend that has come of age in recent years. Virtualization allows organizations to build more resilient, highly efficient, cost-effective technology infrastructures to better support their business-critical systems and applications. Popular virtualization solutions include VMware vSphere and Microsoft Hyper-V. Although virtualization has many, many benefits, here’s a quick look at the high-availability benefit.

Virtual systems can be replicated or “moved” between separate physical systems, often without interrupting server operations or network connectivity. This can be accomplished over a local area network (LAN) when two physical servers (hosting multiple virtual servers) share common storage (a storage-area network [SAN]). For example, if Physical Server #1 fails, all the virtual servers on that physical server can be quickly “moved” to Physical Server #2. Or, in an alternate scenario, if a virtual server on Physical Server #1 reaches a pre-defined performance threshold (such as processor, memory, or bandwidth utilization), the virtual server can be “moved” — automatically and seamlessly — to Physical Server #2.

For business continuity or disaster recovery purposes (discussed in the next section and in Chapter 3), virtual servers can also be pre-staged in separate geographic locations, ready to be activated or “booted up” when needed. Using a third-party application, critical applications and data can be continuously replicated to a disaster recovery site or secondary datacenter in near real-time, so that normal business operations can be restored as quickly as possible.

Implement Disaster Recovery (DR) Processes

A variety of disasters can beset an organization’s business operations. They fall into two main categories: natural and man-made.

In many cases, formal methodologies are used to predict the likelihood of a particular disaster. For example, 50-year flood plain is a term that you’ve probably heard to describe the maximum physical limits of a river flood that’s likely to occur once in a 50-year period. The likelihood of each of the following disasters depends greatly on local and regional geography:

Fires and explosions
Earthquakes
Storms (snow, ice, hail, prolonged rain, wind, dust, solar)
Floods
Hurricanes, typhoons, and cyclones
Volcanoes and lava flows
Tornadoes
Landslides
Avalanches
Tsunamis
Pandemics

Many of these occurrences may have secondary effects; often these secondary effects have a bigger impact on business operations, sometimes in a wider area than the initial disaster (for instance, a landslide in a rural area can topple power transmission lines, which results in a citywide blackout). Some of these effects are

Utility outages: Electric power, natural gas, water, and so on
Communications outages: Telephone, cable, wireless, TV, and radio
Transportation outages: Road, airport, train, and port closures
Evacuations/unavailability of personnel: From both home and work locations

As if natural disasters weren’t enough, man-made disasters can also disrupt business operations, all as a result of deliberate and accidental acts:

Accidents: Hazardous materials spills, power outages, communications failures, and floods due to water supply accidents
Crime and mischief: Arson, vandalism, and burglary
War and terrorism: Bombings, sabotage, and other destructive acts
Cyber attacks/cyber warfare: Denial of Service (DoS) attacks, malware, data destruction, and similar acts
Civil disturbances: Riots, demonstrations, strikes, sickouts, and other such events

For a more complete reference on disaster recovery planning, we recommend IT Disaster Recovery Planning For Dummies.

DISASTER RECOVERY PLANNING AND TERRORIST ATTACKS

The 2001 terrorist attacks in New York, Washington, D.C., and Pennsylvania — and the subsequent collapse of the World Trade Center buildings — had Disaster Recovery Planning and Business Continuity Planning officials all over the world scrambling to update their plans.

This kind of planning is still a highly relevant topic more than 15 years later. The attacks redefined the limits of extreme, deliberate acts of destruction. Previously, the most heinous attacks imaginable were large-scale bombings such as the 1993 attack on the World Trade Center or the 1995 bombing of the Alfred P. Murrah Federal Building in Oklahoma City.

The collapse of the World Trade Center towers resulted in the loss of life of 40 percent of the employees of the Sandler O’Neill & Partners investment bank. Bond broker Cantor Fitzgerald lost 658 employees in the attack — nearly its entire workforce. The sudden loss of a large number of employees had rarely been figured into BCP and DRP plans before. Businesses suddenly had to figure into contingency and recovery plans the previously unheard-of scenario, “What do we do if significant numbers of employees are suddenly lost?”

Traditional BCP and DRP plans nearly always assumed that a business still had plenty of workers around to keep the business rolling; those insiders might be delayed by weather or other events, but eventually they’d be back to continue running the business. The attacks on September 11, 2001, changed all that forever. Organizations need to include the possibility of the loss of a significant portion of their workforces into their business continuity plans. They owe this to their constituents and to their investors.

Disasters can affect businesses in a lot of ways — some obvious, and others not so obvious.

Damage to business buildings. Disasters can damage or destroy a building or make it uninhabitable.
Damage to business records. Along with damaging a building, a disaster may damage a building’s contents, including business records, whether they are in the form of paper, microfilm, or electronic.
Damage to business equipment. A disaster may be capable of damaging business equipment including computers, copiers, and all sorts of other machinery. Anything electrical or mechanical from calculators to nuclear reactors can be damaged in a disaster.
Damage to communications. Disasters can damage common carrier facilities including telephone networks (both landline and cellular), data networks, even wireless and satellite-based systems. Even if a business’s buildings and equipment are untouched by a disaster, communications outages can be crippling. Further, damaged communications infrastructure in other cities can be capable of knocking out many businesses’ voice and data networks (the September 11, 2001, attacks had an immediate impact on communications over a wide area of the northeastern U.S.; a number of telecommunications providers had strategic regional facilities there).
Damage to public utilities. Power, water, natural gas, and steam services can be damaged by a disaster. Even if a business’s premises are undamaged, a utility outage can cause significant business disruption.
Damage to transportation systems. Freeways, roads, bridges, tunnels, railroads, and airports can all be damaged in a disaster. Damaged transportation infrastructure in other regions (where customers, partners, and suppliers are located, for instance) can cripple organizations dependent on the movement of materials, goods, or customers.
Injuries and loss of life. Violent disasters in populated areas often cause casualties. When employees, contractors, or customers are killed or injured, businesses are affected in negative ways: There may be fewer customers or fewer available employees to deliver goods and services. Losses don’t need to be the employees or customers themselves; when family members are injured or in danger, employees will usually stay home to care for them and return to work only when those situations have stabilized.
Indirect damage: suppliers and customers. If a disaster strikes a region where key suppliers or customers are located, the effect on businesses can be almost as serious as if the business itself suffered damage.

The list above isn’t complete, but should help you think about all the ways a disaster can affect your organization.

PLANNING FOR PANDEMICS

In the last hundred years (and indeed, in all of recorded history before the 20th century), several pandemics have swept through the world. A pandemic is a rapid spread of a new disease for which few people have natural immunity. Large numbers of people may fall ill, resulting in high rates of absenteeism; supplier slowdowns; and shortages in materials, goods, and services. Some pandemics have a high mortality rate — many people die.

Contingency planning for a pandemic requires a different approach from that for other types of disasters. When a disaster such as an earthquake, hurricane, or volcano occurs, help in many forms soon comes pouring into the region to help repair transportation, communications, and other vital services. Organizations can rely on outsourced help or operations in other regions to keep critical operations running. But in a pandemic, no outside help may be available, and much larger regions may be affected. In general, a pandemic can induce a global slowdown in manpower, supplies, and services, as well as a depressed demand for most goods and services. Whole national economies can grind to a near-halt.

Businesses affected by a pandemic should expect high rates of absenteeism for extended periods of time. Local or regional municipalities may impose quarantines and travel restrictions, which slow the movement of customers and supplies. Schools may be closed for extended periods of time, which could require working parents to stay at home. Businesses should plan on operating only the most critical business processes, and they may have to rely on cross-trained staff because some of the usual staff members may be ill, or unwilling or unable to travel to work.

Response

Emergency response teams must be prepared for every reasonably possible scenario. Members of these teams need a variety of specialized training to deal with such things as water and smoke damage, structural damage, flooding, and hazardous materials.

Organizations must document all the types of responses so that the response teams know what to do. The emergency response documentation consists of two major parts: how to respond to each type of incident, and the most up-to-date facts about the facilities and equipment that the organization uses.

In other words, you want your teams to know how to deal with water damage, smoke damage, structural damage, hazardous materials, and many other things. Your teams also need to know everything about every company facility: Where to find utility entrances, electrical equipment, HVAC equipment, fire control, elevators, communications, data closets, and so on; which vendors maintain and service them; and so on. And you need experts who know about the materials and construction of the buildings themselves. Those experts might be your own employees, outside consultants, or a little of both.

It is the DRP team’s responsibility to identify the experts needed for all phases of emergency response.

Responding to an emergency branches into two activities: salvage and recovery. Tangential to this is preparing financially for the costs associated with salvage and recovery.

Salvage

The salvage team is concerned with restoring full functionality to the damaged facility. This restoration includes several activities:

Damage assessment: Arrange a thorough examination of the facility to identify the full extent and nature of the damage. Frequently, outside experts, such as structural engineers, perform this inspection.
Salvage assets: Remove assets, such as computer equipment, records, furniture, inventory, and so on, from the facility.
Cleaning: Thoroughly clean the facility to eliminate smoke damage, water damage, debris, and more. Outside companies that specialize in these services frequently perform this job.
Restoring the facility to operational readiness: Complete repairs, and restock and reequip the facility to return it to pre-disaster readiness. At this point, the facility is ready for business functions to resume there.

The salvage team is primarily concerned with the restoration of a facility and its return to operational readiness.

Recovery

Recovery comprises equipping the BCP team (yes, the BCP team — recovery involves both BCP and DRP) with any logistics, supplies, or coordination in order to get alternate functional sites up and running. This activity should be heavily scripted, with lots of procedures and checklists in order to ensure that every detail is handled.

Financial readiness

The salvage and recovery operations can cost a lot of money. The organization must prepare for potentially large expenses (at least several times the normal monthly operating cost) to restore operations to the original facility.

Financial readiness can take several forms, including:

Insurance: An organization may purchase an insurance policy that pays for the replacement of damaged assets and perhaps even some of the other costs associated with conducting emergency operations.
Cash reserves: An organization may set aside cash to purchase assets for emergency use, as well as to use for emergency operations costs.
Line of credit: An organization may establish a line of credit, prior to a disaster, to be used to purchase assets or pay for emergency operations should a disaster occur.
Pre-purchased assets: An organization may choose to purchase assets to be used for disaster recovery purposes in advance, and store those assets at or near a location where they will be utilized in the event of a disaster.
Letters of agreement: An organization may wish to establish legal agreements that would be enacted in a disaster. These may range from use of emergency work locations (such as nearby hotels), use of fleet vehicles, appropriation of computers used by lower-priority systems, and so on.
Standby assets: An organization can use existing assets as items to be re-purposed in the event of a disaster. For example, a computer system that is used for software testing could be quickly re-used for production operations if a disaster strikes.

Personnel

People are the most important resource in any organization. As such, disaster response must place human life above all other considerations when developing disaster response plans and when emergency responders are taking action after a disaster strikes. In terms of life safety, organizations can do several things to ensure safety of personnel:

Evacuation plans. Personnel need to know how to safely evacuate a building or work center. Signs should be clearly posted, and drills routinely held, so that personnel can practice exiting the building or work center calmly and safely. For organizations with large numbers of customers or visitors, additional measures need to be taken so that persons unfamiliar with evacuation routes and procedures can safely exit the facilities.
First aid. Organizations need to have plenty of first aid supplies on hand, including longer-term supplies in the event a natural disaster prevents paramedics from being able to respond. Personnel need to be trained in first aid and CPR in the event of a disaster, especially when communications and/or transportation facilities are cut.
Emergency supplies. For disasters that require personnel to shelter in place, organizations need to stock emergency water, food, blankets and other necessities in the event that personnel are stranded at work locations for more than a few hours.

Personnel are the most important resource in any organization.

Communications

A critical component of the DRP is the communications plan. Employees need to be notified about closed facilities and any special work instructions (such as an alternate location to report for work). The planning team needs to realize that one or more of the usual means of communications may have also been adversely affected by the same event that damaged business facilities. For example, if a building has been damaged, the voice-mail system that people would try to call into so that they could check messages and get workplace status might not be working.

Organizations need to anticipate the effects of an event when considering emergency communications. For instance, you need to establish in advance two or more ways to locate each important staff member. These ways may include landlines, cell phones, spouses’ cell phones, and alternate contact numbers (such as neighbors or relatives).

Text messaging is often an effective means of communication, even when mobile communications systems are congested.

Many organizations’ emergency operations plans include the use of audio conference bridges so that personnel can discuss operational issues hour by hour throughout the event. Instead of relying on a single provider (which you might not be able to reach because of communications problems or because it’s affected by the same event), organizations should have a second (and maybe even a third) audio conference provider established. Emergency communications documentation needs to include dial-in information for both (or all three) conference systems.

In addition to internal communications, the DRP must address external communications to ensure that customers, investors, government, and media are provided with accurate and timely information.

Assessment

When a disaster strikes, an organization’s DRP needs to include procedures to assess damage to buildings and equipment.

First, the response team needs to examine buildings and equipment, to determine which assets are a total loss, which are repairable, and which are still usable (although not necessarily in their current location).

For such events as floods, fires and earthquakes, a professional building inspector usually will need to examine a building to see whether it is fit for occupation. If not, then the next step is determining whether a limited number of personnel will be permitted to enter the building to retrieve needed assets.

Once assessment has been completed, assets can be divided into three categories:

Salvage. These are assets that are a total loss and cannot be repaired. In some cases, components can be removed to repair other assets.
Repair. Some assets can be repaired and returned to service.
Reuse. Undamaged assets can be placed back into service, although this may require them to be moved to an alternate work location if the building cannot be occupied.

Restoration

The ultimate objective of the disaster recovery team is the restoration of work facilities with their required assets, so that business may return to normal. Depending on the nature of the event, restoration may take the form of building repair, building replacement, or permanent relocation to a different building.

Similarly, assets used in each building may need to undergo their own restoration, whether that takes the form of replacement, repair, or simply placing it back into service in whatever location is chosen.

Prior to full restoration, business operations may be conducted in temporary facilities, possibly by alternate personnel who may be other employees or contractors hired to fill in and help out. These temporary facilities may be located either near the original facilities or some distance away. The circumstances of the event will dictate some of these matters, as well as the organization’s plans for temporary business operations.

Training and awareness

An organization’s ability to effectively respond to a disaster is highly dependent on its advance preparations. In addition to the development of high quality, workable disaster recovery and business continuity plans that are kept up to date, the next most important part is making sure that employees and other needed personnel are periodically trained in the actual response and continuity procedures. Training and practice helps to reinforce understanding of proper response procedures, giving the organization the best chance at surviving the disaster.

An important part of training is the participation in various types of testing, which is discussed in the following section.

Test Disaster Recovery Plans

By the time that an organization has created a DRP, it’s probably spent hundreds of hours and possibly tens (or hundreds) of thousands of dollars on consulting fees. You’d think that after making such a big investment, they’d test the DRP to make sure that it really works when an actual disaster strikes!

The following sections outline DRP testing methods.

Read-through

A read-through (or checklist) test is a detailed review of DRP documents, performed by individuals on their own. The purpose of a read-through test is to identify inaccuracies, errors, and omissions in DRP documentation.

It’s easy to coordinate this type of test, because each person who performs the test does it when his or her schedule permits (provided they complete it before any deadlines).

By itself, a document review is an insufficient way to test a DRP; however, it’s a logical starting place. You should perform one or more of the other DR tests described in the following sections shortly after you do a read-through test.

Walkthrough or tabletop

A walkthrough (or tabletop or structured walkthrough) test is a team approach to the read-through test. Here, several business and technology experts in the organization gather to “walk” through the DRP. A moderator or facilitator leads participants to discuss each step in the DRP so that they can identify issues and opportunities for making the DRP more accurate and complete. Group discussions usually help to identify issues that people will not find when working on their own. Often the participants want to perform the review in a fancy mountain or oceanside retreat, where they can think much more clearly! (Yeah, right.)

During a walkthrough test, the facilitator writes down “parking lot” issues (items to be considered at a later time, written down now so they will not be forgotten) on a whiteboard or flipchart while the group identifies those issues. These are action items that will serve to make improvements to the DRP. Each action item needs to have an accountable person assigned, as well as a completion date, so that the action items will be completed in a reasonable time. Depending upon the extent of the changes, a follow-up walkthrough may need to be conducted at a later time.

A walkthrough test usually requires two or more hours to complete.

Simulation

In a simulation test, all the designated disaster recovery personnel practice going through the motions associated with a real recovery. In a simulation, the team doesn’t actually perform any recovery or alternate processing.

An organization that plans to perform a simulation test appoints a facilitator who develops a disaster scenario, using a type of disaster that’s likely to occur in the region. For instance, an organization in San Francisco might choose an earthquake scenario, and an organization in Miami could choose a hurricane.

In a simple simulation, the facilitator reads out announcements as if they’re news briefs. Such announcements describe an unfolding scenario and can also include information about the organization’s status at the time. An example announcement might read like this:

It is 8:15 a.m. local time, and a magnitude 7.1 earthquake has just occurred, fifteen miles from company headquarters. Building One is heavily damaged and some people are seriously injured. Building Two (the one containing the organization’s computer systems) is damaged and personnel are unable to enter the building. Electric power is out, and the generator has not started because of an unknown problem that may be earthquake related. Executives Jeff Johnson and Sarah Smith (CIO and CFO) are backpacking on the Appalachian Trail and cannot be reached.

The disaster-simulation team, meeting in a conference room, discusses emergency response procedures and how the response might unfold. They consider the conditions described to them and identify any issues that could impact an actual disaster response.

The simulation facilitator makes additional announcements throughout the simulation. Just like in a real disaster, the team doesn’t know everything right away — instead, news trickles in. In the simulation, the facilitator reads scripted statements that, um, simulate the way that information flows in a real disaster.

A more realistic simulation can be held at the organization’s emergency response center, where some resources that support emergency response may be available. Another idea is to hold the simulation on a day that is not announced ahead of time, so that responders will be genuinely surprised and possibly be less prepared to respond.

Remember to test your backup media to make sure that you can actually restore data from backups!

Parallel

A parallel test involves performing all the steps of a real recovery, except that you keep the real, live production systems running. The actual production systems run in parallel with the disaster recovery systems. The parallel test is very time-consuming, but it does test the accuracy of the applications because analysts compare data on the test recovery systems with production data.

The technical architecture of the target application determines how a parallel test needs to be conducted. The general principle of a parallel test is that the disaster recovery system (meaning the system that remains on standby until a real disaster occurs, at which time, the organization presses it into production service) runs process work at the same time that the primary system continues its normal work. Precisely how this is accomplished depends on technical details. For a system that operates on batches of data, those batches can be copied to the DR system for processing there, and results can be compared for accuracy and timeliness.

Highly interactive applications are more difficult to test in a strictly parallel test. Instead, it might be necessary to record user interactions on the live system and then “play back” those interactions using an application testing tool. Then responses, accuracy, and timing can be verified after the test to verify whether the DR system worked properly.

While a parallel test may be difficult to set up, its results can provide a good indication of whether disaster recovery systems will perform during a disaster. Also, the risks associated with a parallel test are low, since a failure of the DR system will not impact real business transactions.

The parallel test includes loading data onto recovery systems without taking production systems down.

Full interruption (or cutover)

A full interruption (or cutover) test is similar to a parallel test except that in a full interruption test, a function’s primary systems are actually shut off or disconnected. A full interruption test is the ultimate test of a disaster recovery plan because one or more of the business’s critical functions actually depends upon the availability, integrity, and accuracy of the recovery systems.

A full interruption test should be performed only after successful walkthroughs and at least one parallel test. In a full interruption test, backup systems are processing the full production workload and all primary and ancillary functions including:

User access
Administrative access
Integrations to other applications
Support
Reporting
… And whatever else the main production environment needs to support

A full interruption test is the ultimate test of the ability for a disaster recovery system to perform properly in a real disaster, but it’s also the test with the highest risk and cost.

Participate in Business Continuity (BC) Planning and Exercises

Business continuity and disaster recovery planning are closely related but distinctly different activities. As described in Chapter 3, business continuity focuses on keeping a business running after a disaster or other event has occurred, while disaster recovery deals with restoring the organization and its affected processes and capabilities back to normal operations.

If you don’t recall the similarities and differences between business continuity and disaster recovery planning, we strongly recommend that you refer back to Chapter 3!

Security professionals need to take an active role in their organization’s business continuity planning activities and related exercises. As a CISSP, you’ll be a recognized expert in the area of business continuity and disaster recovery, and you will need to contribute your specialized knowledge and experience to help your organization develop and implement effective and comprehensive business continuity and disaster recovery plans.

Implement and Manage Physical Security

Physical security is yet another important aspect of the security professional’s responsibilities. Important physical security concepts and technologies are covered extensively in Chapter 5 and Chapter 7.

As with other information security concepts, ensuring physical security requires appropriate controls at the physical perimeter (this includes the building exterior, parking areas, and common grounds) and internal security controls to (most importantly) protect personnel, as well as to protect other physical and information assets from various threats, such as fire, flooding, severe weather, civil disturbances, terrorism, criminal activity, and workplace violence.

Address Personnel Safety and Security Concerns

Security professionals contribute to the safety and security of personnel by helping their organizations develop and implement effective personnel security policies (discussed in Chapter 3), and through physical security measures (discussed in the preceding section, as well as Chapter 5 and Chapter 7).

Saving human lives is the first priority in any life-threatening situation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Security Operations

Create new playlist

Sign In

Sign Up