CHAPTER 53

MONITORING AND CONTROL SYSTEMS

Caleb S. Coggins and Diane E. Levine

53.1 INTRODUCTION

53.1.1 Prevention, Detection, and Response

53.1.2 Controlling versus Monitoring

53.1.3 Control Loop

53.1.4 Defining the Scope and System Requirements

53.2 CHANGE AND SECURITY IMPLICATIONS

53.2.1 Regulations, Policies, and Frameworks

53.2.2 Change Management

53.2.3 Configuration Protection

53.2.4 Performance Considerations

53.3 SYSTEM MODELS

53.3.1 Internal, One to One, One to Many, and Distributed

53.3.2 Automation and the Human–Machine Interface

53.3.3 Snapshots versus Real Time

53.3.4 Memory Dumps

53.4 TARGETS AND METHODS

53.4.1 Overview

53.4.2 Process Flow and Job Scheduling

53.4.3 Network Connectivity

53.4.4 Environmental Concerns

53.4.5 System State

53.4.6 System Components

53.4.7 Process Activities

53.4.8 File System

53.4.9 Access Controls

53.5 LOG MANAGEMENT

53.5.1 Log Generation

53.5.2 Types of Log File Records

53.5.3 Automation and Resource Allocation

53.5.4 Log Record Security

53.6 DATA AGGREGATION AND REDUCTION

53.6.1 Centralized Data Stores

53.6.2 Filtered Queries

53.6.3 Analyzing Log Records

53.6.4 Dashboards

53.7 NOTIFICATIONS AND REPORTING

53.7.1 Alerts

53.7.2 Trend Analysis and Reporting

53.8 MONITORING AND CONTROL CHALLENGES

53.8.1 Overview

53.8.2 Industrial Control Systems

53.8.3 Mobile Computing

53.8.4 Virtualization

53.9 SUMMARY

53.10 REFERENCES

53.11 NOTES

53.1 INTRODUCTION.

Monitoring and control (M&C) systems address security events through prevention, detection, and response. When aligned with security policy and appropriately implemented, these systems enable organizations to analyze trends, manage security incidents, and mitigate risk. Automation will play an important role as organizations seek to improve the efficiency and usability of real-time monitoring, controls, and secure log management. Although the scope of M&C systems may vary in size and complexity, the central components remain the same: data collection, data reduction, and system response.

Data collection includes log file generation and storage. Computer systems use log files to store records of events. Logged information includes system time, relevant users, affected systems, and the actions performed. Data reduction aggregates and correlates data into an easily understood and usable format. These information dashboards enable information managers and operators to understand quickly the state of the infrastructure, without the historically time-consuming process of manually reviewing individual log files. System response includes alert notification, corrective actions, and recovery procedures. The response may be automated or may involve feedback through a human-machine interface (HMI). The future viability of an information infrastructure relies on the effectiveness of these components.

53.1.1 Prevention, Detection, and Response.

In an idealistic scenario, automated systems would intelligently anticipate, detect, and prevent all problems from negatively impacting systems and business processes. However, not all problems merit costly security systems or dedicated sensors. For practical purposes, organizations must deploy cost-effective sensors and controls necessary to mitigate unacceptable risk. Prevention refers to the system's ability to prohibit an undesirable action. Common tools used in network security include in-line intrusion prevention systems (IPSs), intrusion detection systems (IDSs), and unified threat management (UTM) appliances. Deploying these systems at choke points and at the network edge restricts the scope of worm propagation and other undesirable network communications. In the physical security realm, single-person mantraps prevent individuals from tailgating into secure facilities.1

Detection refers to the ability to identify attacks and other security concerns through sensory input. Systems rely on automated detection mechanisms and human observation. M&C systems must have reliable detection mechanisms in order to manage security events and provide meaningful metrics on the state of the information infrastructure. Proactive detection may involve the authorized use of a vulnerability scanner by penetration testers. Reactive detection includes the identification of an attacker who performed a denial of service (DoS) attack on an enterprise network.2

Response refers to the output of M&C systems. A monitoring system may detect a malicious script disabling the host-based firewall on all of the workstations in the internal network, then log the activity and send an administrative alert to an operator or centralized correlation engine. A control system, however, would correct the firewall state or possibly even prohibit someone from making the change in the first place. Additional human response may occur after the administrative alert or dashboard refresh, such as escalation to an incident response team or implementing new controls to protect workstation integrity.3

Failure to detect and report security breaches may have legal ramifications, as many states require security breach notification.4 Implementing defense in depth (layered defenses) increases the likelihood of threat detection, incident prevention, and response. AC power loss at a secure facility and a nonresponsive battery or generator backup may render a closed-circuit television (CCTV) system incapable of detecting a physical attack. An additional layer, such as physical security guards, can provide a degraded level of monitoring while technicians restore power and monitoring systems to normal functionality.5

53.1.2 Controlling versus Monitoring.

Monitoring refers to the observation, logging, and reporting of systems, components, and events. Data collected from monitoring systems may identify intruders, locate system vulnerabilities, confirm operational status, and provide confirmation that controls are online and in place. M&C systems operate in two modes, continuous and batch.

  1. Continuous mode provides ongoing measurement and compliance assurance. Systems that operate in real-time, continuous mode present a more accurate, up-to-the-second view of an organization's operational environment.
  2. Batch mode takes place at specific points in time. An annual audit often utilizes batch mode analysis, to evaluate system controls and scan for noncompliant processes.

Control refers to what we can manipulate and audit. Rather than merely observing and recording information about an unauthorized change in the operational environment, controls can restrict such activities based on predefined configurations and processes. Control systems ensure that the operational environment reflects stated policies and standards. The Control Objectives for Information and Related Technology (COBIT) section AI1.9 and ISO17799 section 9.7 identify the need to monitor system access and use (see Chapters 44 and 54 in this Handbook for discussion of these standards).

Organizations that depend on passwords for authentication should establish policies and draft procedures to meet those requirements. If the standard includes a 12-character minimum password length, with specific complexity requirements, then password controls would prohibit users from generating a 5-character password such as PASS1.6

Auditors inspect the operational environment to ensure controls meet regulatory requirements and that they reflect stated policies and standards. Typically, auditors evaluate a sample of control sets at scheduled intervals. These point-in-time, manual efforts provide insight into the infrastructure's degree of compliance. However, they do not provide the same level of efficiency as integrating automation into a framework that supports continuous information assurance.

The environment directly impacts system security. Temperature, humidity, air quality, and other variables may introduce unscheduled system failures in a work environment. Such variables outside our control that directly affect our information infrastructure require monitoring. Computer system management applications monitor internal components as well as temperature and voltage levels. Most organizations may not be able to control the temperature of an overheating processor or a lack of electricity provided by the utility company, but sensory thresholds in a monitoring application may trigger automated scripts or notify local personnel to initiate a system shutdown.

Organizations decide to implement systems with monitoring or control capabilities for a number of reasons including cost, logistical difficulties, and intentional design. For organizations with a nascent security program and incomplete understanding of the operational environment, monitoring can even provide a means of transition to future controls. As part of a cultural shift, monitoring can baseline current activities such as Internet usage and gradually introduce filtering controls to enforce security policies.

53.1.3 Control Loop.

For most environments, humans remain in the control loop. The control loop consists of the controller, the target system, a two-way communication path, and the transmitted data. The control loop is considered a closed loop when the process is automated and does not require human interaction; an open loop involves human interaction. The requirements of the system determine the loop type. When gas pipeline sensors detect a sudden decrease in pressure, an automated control system may need to perform an immediate lockdown to isolate the leak and send alert notifications. Leaving the loop open and waiting for human response may result in significant product loss and possibly in environmental and legal difficulties. In the case of operating system patch deployment, an open loop would allow a system to interrogate hosts, determine patch levels, and report status for further review and action. A supervisor could then approve specific patches to particular systems, after sufficient lab testing and coordination with system stakeholders.

53.1.4 Defining the Scope and System Requirements.

For M&C systems to operate effectively, management must clearly define the scope and system requirements. The scope specifically refers to the relevant processes and the people who interact with the computing environment. In order to monitor the Research and Development (R&D) Department for network communications regarding intellectual property, the scope may include IT/IS support staff, shared network resources, and existing processes to communicate with remote sites and external departments. The system requirements refer to the capabilities necessary to perform the desired functions of M&C. The requirements for monitoring R&D may include the need for real-time packet captures, an algorithm to identify intellectual property, automated alerts, and a reasonably secured collection device. Installing a network test access port (TAP) on the network uplink and then writing the captured traffic to a data collector for future analysis may seem logical. However, if a site survey identifies the use of undocumented cellular modems, unencrypted wireless hot spots, and personal e-mail accounts for sensitive communications, the network TAP will fail to address the original goal. In this case, the system requirements could expand to include control elements, such as network access control (NAC), mandatory IPSec communication policies, and security awareness training for R&D personnel.

53.2 CHANGE AND SECURITY IMPLICATIONS

53.2.1 Regulations, Policies, and Frameworks.

Compliance requirements and an organization's security program influence the implementation of M&C systems. Regulations may require access controls to restrict information to authorized users and sufficient monitoring to detect unauthorized attempts to gain access to information resources. In response to HIPAA, an organization may need to integrate additional log review procedures as well as appropriate measures to control all systems and processes that touch or potentially transfer personally identifiable information (PII). A public company that regularly generates financial reports needs adequate controls, to ensure accurate reporting. A security policy may reference the need to comply with Sarbanes-Oxley (particularly Section 404). The scope of an M&C system would then need to include computing resources, information process flows, and personnel with access to the financial information system. M&C systems should reflect the expectations of an organization's security policy and should fit into a relevant framework, such as COBIT. Utilizing a control framework provides logical management of M&C systems. The framework enables management to further understand enterprise processes and determine the need for risk mitigation or acceptance. PO9.7 in COBIT identifies control systems as part of the safeguard selection portion of risk management.

53.2.2 Change Management.

M&C systems enhance an organization's ability to identify and manage change in the information infrastructure. Knowing when authorized and unauthorized changes occur can improve problem resolution and can help in attack identification. In a data center environment, centralized system monitors track operational status. If database servers suddenly stop responding to network requests, a monitoring system could alert site staff to investigate the issue immediately. The cause may trace back to the remote exploit of a known vulnerability or to an unscheduled patch installation on production database servers during peak business hours. In either case, the information reported enables management to assess the situation and to implement corrective actions to reduce the likelihood of future attacks or procedural oversights.7

53.2.3 Configuration Protection.

When new systems are deployed or legacy systems are brought up to specific compliance levels, configuration concerns surface. Although policy-based standards may correctly define the appropriate configurations to protect the information assets, failure to consistently apply configuration templates or checklists introduces unnecessary complexity to vulnerability management. To address this challenge, M&C systems can centrally identify anomalies and bring systems back into compliance using predefined criteria. The PCI Data Security Standard (DSS) requires file integrity detection and alerting. However, an organization may also find it advantageous to deploy a system that will not only detect and alert, but also will automatically remediate, integrity discrepancies, and will quarantine the noncompliant files.

In industrial control systems (ICSs), programmable logic controllers (PLCs) store and run ladder logic programs to control hardware devices in the field. Programmers or automation specialists update these programs using specialized software for industrial networks. Modifying the program will directly affect the output of the actuators connected to the PLC. For that reason, the program must be protected from malicious and accidental alteration. Incorrectly modifying a tank sensor's set point to a value greater than the tank's capacity could result in product loss, requiring manual override. Using a password to protect the running program and layered network defenses, such as a network IPS, contribute to configuration protection in this scenario.

53.2.4 Performance Considerations.

Like all systems, M&C systems require resources to operate. Depending on the operational model, these resources may come from the same host system, a network connection, or a remote device. Failing to gauge the performance impact of a monitoring system could leave the operational environment in a degraded state. Additionally, deploying new systems during peak times in a business cycle may disrupt normal operations. Installing an enterprise file integrity monitoring system and performing a baseline on production systems during peak hours could lead to overutilization of system resources and could halt critical production jobs. Changes in M&C require planning and coordination with relevant business units and technical staff, to ensure that new systems meet both the business needs and the technical requirements to operate efficiently.

53.3 SYSTEM MODELS

53.3.1 Internal, One to One, One to Many, and Distributed.

M&C systems fit into four general models: internal, one to one, one to many, and distributed. The type of model used depends on the complexity of the system and the type of information required. These models may include autonomous systems as well as those that require human interaction.

  1. An internal system is one of the simplest forms of M&C; it monitors or controls itself. A single server or network device may monitor processor utilization or other system components, to evaluate system performance issues. From a control perspective, an internal system may prohibit specific actions, such as unauthorized attempts to modify system files or to log on locally to the system.
  2. The one-to-one model enables one system to monitor or control another independent system. This would be fairly costly, if heavily implemented in large environments. However, it can be quite useful in specific, high-availability situations that require failover. A network firewall, or critical server, would continually monitor and mirror the configuration and session information of the primary system. Once the primary system faults, the monitoring system would assume production functions, to avoid significant disruptions to the business environment. Another example may involve a two-node array of Internet proxy servers. Each server can monitor the other one and trigger an alert when a system faults. Until the system issue is resolved, the array may operate in degraded mode.
  3. The one-to-many model is most common in enterprise environments. A central monitor and control system remotely manages a number of systems. From a central point, all targets may be interrogated and updated, without the need for manual, one-to-one changes. Centralized M&C can also reduce cost by simplifying M&C functions of multiple resources at various sites. This model also improves audit efficiency, as fewer systems must be reviewed to analyze changes throughout the information infrastructure.
  4. The distributed model involves sensors and controls dispersed throughout the environment. The control elements may operate independently or remotely with input from a central control system. Each system feeds information to a central collector for secure log management, reporting, and data reduction. For organizations attempting to manage heterogeneous devices, a distributed model enables staff to implement the best M&C components for specific systems while still maintaining the ability to aggregate data into a centralized repository. This level of aggregation provides a wider view of change in the infrastructure.

53.3.2 Automation and the Human-Machine Interface.

Automation enables M&C systems to collect sensory information and to initiate a response, without the need for human interaction. Many information systems operate round the clock and generate large volumes of event logs. Delegating to operations staff the task of manually reviewing thousands of event logs, and correlating events into meaningful information would be impractical. The purpose of M&C systems is to obtain meaningful intelligence from target systems and to report and correct anomalies. By automating data collection and reduction, managers and staff can review event summaries and respond to situations that have not already been automatically controlled.

The human-machine interface (HMI) is the point at which an operator communicates with a monitoring or control system. In industrial environments, HMI middleware transforms system activities into interactive screens, with enough capabilities for staff to perform specific job functions. The presentation of data is an important factor for HMI. Operators must have the ability to perform their job functions and to maintain adequate situational awareness, without access to unnecessary elements of a system. If the HMI interface runs on a standard personal computer, the operator should not be permitted to use the system to watch DVD movies or to participate in personal file-sharing networks on the Internet.

The time-saving value of automation has also been realized in autonomous vehicles, such as those implemented by the U.S. Department of the Navy. Unmanned undersea vehicles (UUVs)8 reflect the distributed model, performing autonomous actions, with information exchange through a sensor grid within the Global Information Grid (GIG).9 These systems can monitor environmental variables, detect threats such as unexploded ordinance, respond to change, and communicate with system owners. Much like intrusion detection and prevention systems in a network environment, UUVs can perform functions that may be unsuitable for direct human interaction. Both can perform continuous monitoring and collect a large volume of information that could not be collected through manual efforts. Also, similar to intrusion prevention systems in a network environment, a UUV may detect a threat, alert nearby UUVs, and report status back via the GIG. For network IPSs with worm-throttling capabilities, wormlike activities may be detected and cross-reported to additional sensors for rapid anomaly detection and containment.

53.3.3 Snapshots versus Real Time.

Snapshots provide a point-in-time view of a target system. Auditors use these for regularly scheduled audits. Scanning the environment on a monthly basis to confirm compliance can also unearth security trends within an environment. Trend reports identify security performance over time. An organization may notice a sharp decline in vulnerabilities after an external audit. Over time, without a stable vulnerability management program, the number of vulnerabilities will return to preaudit levels. Snapshots do not provide immediate or ongoing information on changes or corrective actions made to production systems. Additionally, snapshots will not verify that a control is working all of the time, only during the time that the snapshot was taken. If frequent changes impact a critical system during unexpected intervals, real-time M&C may be necessary. Real-time M&C can also address regulatory requirements and point the way toward proactive security.

Real-time monitoring refers to persistent, ongoing observation of a target. Real-time control refers to a control system's ability actively to influence its target. Industrial environments depend on information gathered through real-time M&C. When liquefied natural gas (LNG) travels through an LNG terminal,10 operators should not have to wait until the tanker is empty in order to confirm that any of the product successfully transferred to the storage tanks. An M&C system can continually monitor the volume of product leaving the vessel, compare that data to the volume entering the storage container, trigger alerts if a leak is detected, and initiate a shutoff to avoid product loss. When allocating and distributing resources to meet demand, real-time monitoring provides overseers with an immediate understanding of the areas of need. In a network environment, this may include peak network usage during an accounting cycle or daily floods of lunchtime Internet traffic. In industrial areas, real-time monitoring can detect where electric grid managers should route additional power resources to meet demand, how much to pull from reserves, as well as when to implement rolling blackouts to maintain grid integrity.11

For Internet-facing Web sites, real-time monitoring would identify a distributed denial of service (DDoS) attack and alert staff. With real-time control, a network security device could throttle down or block the attack traffic. For nonmalicious traffic floods, the Web site host might utilize an intelligent load balancer and geographically dispersed reserve systems to redirect requests to available resources. DS13.3 in COBIT identifies the need for adequate event logging to analyze activities over time. However, without well-planned, automated processes, and data reduction, the volume of data generated from snapshots and real-time activities will overwhelm staff. Through centralized data collection and reduction, security auditors can evaluate the results of both real-time and snapshot reporting. This combination can identify general trends, ensure that M&C systems continually operate, and preserve evidence of a real-time attack.

53.3.4 Memory Dumps.

Memory dumps are representations of the data in memory. Memory dumps are used most typically after a system failure, but they are also useful in forensic research when investigators want the maximum amount of information possible from the system. There are two approaches to obtaining copies of memory: online, using diagnostic utilities while the system is running, and off-line, from magnetic storage media to which memory regions are copied.

53.3.4.1 Diagnostic Utilities.

Diagnostic utilities are system software routines that that can be used for debugging purposes. Also known as debug utilities, these programs run at maximum privilege (root or supervisor level, or their equivalents) and allow the privileged user to see or modify any portion of memory. The utilities usually print or display the contents of memory regions in a variety of formats such as binary, octal (base 8), or hexadecimal (base 16), with conversion to ASCII for easier readability. The utilities generally can provide immediate access to memory structures such as terminal buffers that allow the analyst to see what specific users are typing or seeing on their screens, file buffers that contain data in transit to or from specific open files, spoolers (print buffers), and program-specific regions such as data stacks. Because debug utilities also allow modification of memory regions, they are to be used with the utmost circumspection; for security reasons, it is wise to formulate a policy that no debug utility with root access can be run without having two people present. In high-security operations, the output of debug utilities should be logged to paper files for proof that no unauthorized operations were carried out using these programs. Access to privileged debug programs should be tightly controlled, such as by strict access-control lists or even by encryption using restricted keys.

53.3.4.2 Output to Magnetic Media or Paper.

One method of doing a memory dump and follow-up analysis is by copying the data from memory onto magnetic media such as tape, removable disks, or rewriteable CDs. Although it was once practical to print the entire contents of memory to paper, the explosive growth of memory sizes makes such printing impractical in today's systems. In 1980, for example, a large multiuser minicomputer might have 1 megabyte of RAM available, so that the printout was a quite manageable, half-inch-thick stack of paper. At the time of this writing, however, when it is common to see PCs with 256 megabytes of RAM, a single printout of the entire contents of memory could be several feet thick.

Copying memory to a static storage medium is, therefore, preferred for memory dumps following system crashes.

53.3.4.3 Navigating the Dump Using Exploratory Utilities.

On production systems using reliable operating systems, crashes are rare and are generally explored thoroughly to identify the causes of the anomaly. Generally after creating a dump it is necessary to study it looking for the problem and then correcting it. For large-memory systems, exploratory utilities are used to speed the search for problems in the dump by allowing the analyst to find any named system table or other memory region.

53.3.4.4 Understanding System Tables.

System tables are directories of system data where each datum is identified by an assigned label, by its position in the table, or by pointers from other tables. An understanding of system tables is essential in analyzing system problems.

Regardless of the details of the operating system and the system-specific names of tables, some of the important system tables include:

  • Process control table, Pointers to the process tables for each process that is running on the system or that was running when the copy of memory was obtained
  • Process tables. Detailed information about each process, with pointers to all the tables for that particular process
  • Data stacks. All the variables used by specific processes
  • Buffers. Data in transit to or from files and devices, such as disks and terminals
  • Memory management tables. Lists of available memory blocks
  • Interprocess communications tables. For example, information about resources locking or any logical flags used by multiple processes

Working with someone who understands the detailed structure of the operating system tables can be critically important for security work in which investigators must determine exactly what happened during an intrusion or other unauthorized use of a system.

53.3.4.5 Security Considerations for Dump Data.

Memory dumps must be secured while in use and destroyed when appropriate. The dump contains the totality of a system's information in memory, including such data as:

  • Passwords that had just been typed into terminal buffers for use in changing logons or for accessing restricted subsystems and applications
  • Encryption keys
  • onfidential data obtained from restricted files and not authorized for visualization by operations staff (e.g., medical data from personnel files)
  • Financial data that could be used in frauds
  • National security information restricted to higher levels of clearance than the system administration

53.4 TARGETS AND METHODS.

This section provides an overview of the general classes of logging; Section 53.5 discusses specific types of log file records and their applications in system management, security management, and forensic analysis.

53.4.1 Overview.

Targets and methods used to perform control and monitoring vary, based on the business scope and the target system functions. Targets include process flow, job scheduling, network connectivity, environmental measurement, system state assurance, system component status, process activities, configuration settings, file system information, and access control. Current regulations may include only a subset of an organization's information systems in the scope of compliance. Changes to the infrastructure and compliance requirements are inevitable. It is important to implement scalable and flexible M&C in order to accommodate future needs.

53.4.2 Process Flow and Job Scheduling.

Some organizations use a combination of mainframe and distributed system resources. The mainframe group may use a particular batch job scheduler to process and monitor job status and output. Other distributed systems throughout the organization might use their own schedulers and logging capabilities. The complexity of this design, however, does not facilitate enterprise-wide monitoring and auditing. A centralized job scheduler may be necessary to monitor and control job flow. The centralized system will need to connect to disparate systems to collect job status and initiate commands. For management, an enterprise view of job scheduling can identify unnecessary complexity and redundancy, leading to future process improvement. When shifting to a central job scheduling system, it may seem desirable to shift the entire environment immediately or possibly to revert to a central mainframe. However, a gradual migration to centralized M&C would be a more ideal method of change.

53.4.3 Network Connectivity.

Network connectivity refers to the devices, protocols, and communication media (wired or wireless) used in a computing environment. A network operations center will want to monitor the status of network links, health indicators of critical devices, and data streams traversing the network. In a layered environment, physical access to the network cabling may be secured in cable raceways and network closets or at distribution points. Devices will be up to date and configured according to best practices. Unnecessary protocols will be controlled by blocking or routing suspicious traffic to network black holes and honeynets for further intrusion analysis. Network intrusion prevention systems (IPSs) can identify a communications anomaly, block the communication exchange, log the event, and trigger an administrative alert. Prevention requires planning and sufficient controls to deny an attack from achieving its goal. A method to implement M&C for sensitive network nodes may involve moving nodes to isolated segments, deploying in-line network IPSs, and utilizing host-based security agents. Additional enterprise security assessment tools may also scan, report, generate tickets, propose remediation measures, and analyze security trends over time.

In addition to monitoring general network connectivity, some networks are specifically deployed for the purpose of M&C. Legacy industrial systems now integrate common Ethernet networking elements into mechanical, electromechanical, and electronic environments. Wireless continues to emerge as an alternative means of M&C. The Zigbee12 specification utilizes the IEEE 802.15.4 standard to provide ad hoc, local networking connectivity. Zigbee is primarily used for M&C. In a plant floor or outdoor facility, additional cabling for M&C systems may be infeasible. A potential solution may be to integrate inexpensive, robust Zigbee control points. Organizations that continue to rely on dial-up modems and standard wired telephone communications may have a need to implement line taps or to detect rogue taps, within the bounds of the law. For geographically dispersed sites, satellite and leased-line network connections bring sites together to manage M&C data centrally.

53.4.4 Environmental Concerns.

Some environmental variables are easier to monitor than control. Whether we are using Zigbee to transmit sensory information or a software package to measure AC voltage irregularities, environmental measures provide additional situational awareness for information systems and staff. Humidity, temperature, smoke, and leaking water directly impact computing systems in a data center. Dedicated air-conditioning units and dehumidifiers can monitor and control humidity and temperature. Indeterminate sources of smoke or water leakage may be difficult to control with automated response, but monitoring and alert notification may trigger corrective procedures by staff in order to mitigate risk.

As part of a business continuity plan, it may become necessary for an organization to monitor and prepare for environmental threats such as floods, structural collapse, and power loss. A monitoring system, with early warning capability, could notify staff to report to an alternate work site or to prepare for contingency operations. In the event that an environmental threat disrupts control systems, staff must be capable of manual override. The EPA recommends familiarity with manual operations, in the event of a supervisory control and data acquisition (SCADA) system failure.13 In addition to manual override, integrating redundant M&C systems with seamless failover can decrease the likelihood of operational disruptions.

53.4.5 System State.

The system state refers to a collection of critical variables on the target system. In industrial control systems, the system state may reflect the overall picture of an electrical grid, or it may reduce to whether a single valve is open or closed. A monitoring system may need to track running processes in memory, services, open port connections, system files, hardware status, and log data. When an antivirus application process fails to start, resulting in a subsequent malware infection, the monitoring system can trigger an alert so that staff or an automated control system can isolate the infection. One method of monitoring the system state is to run an agent on the host system. Software agents monitor a predefined list of items and report status on a regular basis to a central monitoring hub. When the agent fails to respond or “check in,” the monitoring system can issue an alert to investigate the issue.

Host intrusion prevention systems (HIPS) protect the system state of nodes in a network environment. HIPS analyze activities based on signature and heuristic detection routines. They provide a reasonable means of Monitoring and Control (M&C), with the potential for centralized reporting and attack correlation. For postmortem analysis, memory dumps can be useful in analyzing problems with the system state. If a network monitoring system detected botnet-related activity originating from a system in human resources, capturing the live system state may be advantageous to an investigation. As part of the evidence collection process of forensics, controlling and preserving the system state requires specialized training and tools such as Helix.14

53.4.6 System Components.

Monitoring applications can track the usage and overall performance of system components, including CPU, memory, and storage media. Operating systems also have built-in tools to monitor components and store results in log files. Inventory management systems collect component-level information as well as software information as configured by operations staff. Six months after an equipment refresh, an audit may identify systems operating with only half of the originally purchased memory. The logged information will help to determine the scope of the issue.

Monitoring the performance of system components can also identify changes in business need or system abuse. With the deployment of a new Web portal, existing Web proxy servers may no longer be capable of handling the heavier load. Logged performance data would corroborate the need to expand the infrastructure or add additional capacity to existing systems. Performance set points trigger alerts when components depart from normal operating levels. The alerts may identify externally driven changes, such as voltage drops, or internal limits, such as maximum memory utilization. Heavy resource utilization on a historically idle PC running HMI may even point to misuse of the system for non-SCADA related activities, such as Internet surfing or video playback.

System component monitors most often attempt to identify overutilized CPUs, heavy memory consumption, and full storage media. However, identifying routinely underutilized resources gives management the ability to reallocate resources to areas of need rather than purchase unnecessary additions to the infrastructure. Disk quotas control the allowable amount of usable media for users and applications. To protect a system from unintentional or malicious overutilization, quotas should be reviewed as part of a layered strategy. Virtual machines use storage, memory, and processor quotas to control the resources utilized on a host system. With large virtualization environments, an enterprise virtual manager system may be necessary to ensure adequate load balancing across all of the virtual system hosts as well as to determine if additional tuning is required to improve virtual machine performance.

53.4.7 Process Activities.

A process refers to a running program in a computing device. Processes may initiate at the system or user level and may contain multiple threads. M&C processes can prevent malware infections and can identify abnormal process activities. Antivirus systems typically address this area. The SQL Slammer worm targeted a vulnerable process over a network connection, injecting code into memory without writing to disk storage.15 In addition to process monitoring, network traffic analysis and network IPSs could mitigate this type of activity. Using enterprise monitoring tools to aggregate critical process information on a number of systems will help to identify application issues, such as runaway processes.

Monitoring process activities can also play a role in software metering. Organizations may want to identify application usage for different departments, in order to avoid deploying unnecessary applications and paying higher licensing fees. Agents can centrally report elapsed time for target processes, for data aggregation, and for management report generation. For performance tuning and resource evaluation, agents can monitor CPU time for running programs. High CPU time would indicate high utilization of CPU resources. For behavior modification, organizations may choose to monitor and control unauthorized applications. Blocking the execution of known applications by file name or hash value can limit the level of follow-up required to locate and kill remote processes. Monitoring for uncategorized or unapproved processes may further help in identifying the cause of performance degradation on remote systems.

53.4.8 File System.

File system activities provide a wealth of auditing information. File systems store log data ranging from application errors to unauthorized authentication attempts. Just like IDS logging, file system logging requires performance tuning to aggregate the data and to avoid data loss. Logging every file access attempt and data modification on a system will quickly fill up an event log. If system owners configure the log file to overwrite when full, critical events may be lost. The volume of log entries also impacts network resources. A constant flow of file system activity log entries may create a network bottleneck and disrupt the data collection process. Focusing on critical system files, sensitive data, and configuration settings will decrease the number of log entries and provide a more targeted set of results.

The configuration files for operating systems and applications require layers of protection to avoid unauthorized modification. Much like the layers to be discussed in Section 53.5.4, regarding log record security, Access Control Lists (ACLs), encryption, digital signatures, and checksums can play a critical role in protecting objects within the file system. Attempts to bypass M&C systems may begin with configuration files and settings. End users attempting to bypass Internet Web filters may reconfigure their Web browsers to use a rogue proxy server. Attackers attempting to modify the configuration of a poorly secured Web server may leverage the security lapses to target additional systems connected to the internal network. Controlling changes to configuration settings via system policies and HIPS will mitigate these kinds of attacks.

53.4.9 Access Controls.

Access control is a central part of security management and compliance efforts. Organizations often rely on more than one stand-alone system for identification and authentication, including applications, network operating systems, and building security. As a result, they face significant challenges while attempting to communicate, aggregate, and manage access control events. Adding and removing staff to a secure environment may involve badge identification, security tokens, and passwords. An M&C system targeting access control will need to ensure that unauthorized personnel not have access to any physical or digital resources. Management may first need to develop a process flow with support from operations staff and business stakeholders.

Aggregating access logs and detailing failed login attempts and multiple user instances aid in the identification of system abuse. When users log on to network resources, they establish a session with the target system. Multiple sessions for the same user, at multiple sites, may indicate that a user account has been compromised or is being shared. Depending on the level of importance, this information may surface in an exception report, trigger an administrative alert, or result in an automated account lockout. Similarly, a series of invalid login attempts may reach a threshold and automatically disable the user account for a predefined period of time. This control minimizes the effort required to actively monitor system accounts while ensuring that attackers are not able to perform a brute-force attack on the credentials of an authorized user.

53.5 LOG MANAGEMENT

53.5.1 Log Generation.

A log file contains a record of events. This record is the basic building block for centralized M&C systems. These files serve as a digital audit trail for system activities. Operating systems and applications may contain built-in logging capabilities. However, logging is not always enabled, or it may even require additional tools to convert the output into a usable or aggregated form. The system monitors activities and writes relevant information as a log file entry. Operating system log files track application issues, access attempts, and system-wide problems. Thresholds and file sizes can be adjusted to meet the requirements of the environment. It may be unnecessary to write a log entry every time a user successfully opens or closes a file on an internal workgroup file server, but doing so would be necessary to enable success and failure auditing for users attempting to log on to a sensitive financial system.

Transaction logs take logging one step further, by storing copies of the actual changes made to a system. They have a fixed size, automatically generating a new file for future transactions. Most common in database applications, transaction logs can be used to rebuild a corrupt database. Using a known-good database backup and all of the transaction logs prior to the failure, databases can be rolled forward to a reliable point in time. Transaction logs should be actively monitored, as they can quickly consume large amounts of storage. Normally, committed transaction logs are purged after a successful backup. Other log types, such as Web site traffic, may automatically generate new log files once per day; these logs may not contain entire Web site transactions, but they will identify the visitor based on predefined criteria. The key here is the flexibility in determining what will be monitored and what format will work best for log generation.

Data retention policies must clearly define how long data should be retained, according to legal requirements and internally driven policy. Log file archiving and storage become problematic over time. System-level log files may reach maximum size and overwrite existing log data or trigger a system shutdown. Centralized log management systems may slow to a crawl when querying a relational database for data to export to a flat file. Log collection and analysis systems must be scalable enough to process large volumes of data that will increase in size and number over time. A system designed to handle an average load of 10GB per day, with a 12-month retention cycle, may require compression technology to manage efficiently a 200GB-per-day average load during the next fiscal year.

53.5.2 Types of Log File Records.

A log file (sometimes known as an audit trail) traditionally contains records about transactions that have updated online files or otherwise represent events of interest (e.g., logging on to the system or initiating a process). Some database or application log files also may contain copies of records that were affected before or after the logged transaction. There are a number of different types of log file records, and each contains a specific kind of information. Typically, all the operating system log records are stored in files with a maximum configured size, and that are identified by monotonically increasing file numbers. When the file fills up, the operating system opens another one with the next number in sequence as its identifier. Because log file records of different types have different sizes, on some operating systems the log file is defined with a variable record length.

53.5.2.1 System Boot.

The system boot log record contains information on booting up, or starting, the machine. Since this is related to a specific activity and generally to a specific area of the hardware or media, information on the boot can prove helpful when the boot fails and analysis needs to take place.

53.5.2.2 System Shutdown.

The system shutdown log record contains information on when the system was shut down and by whom. This information can be invaluable when attempting to analyze a problem or find a saboteur. On systems that include emergency shutdowns (e.g., by calling routines such as suddendeath when system parameters fall outside the range of allowable values), the shutdown record may contain specific information about the cause of the emergency shutdown.

From a security standpoint, trustworthy system boot and system shutdown records can prevent a malefactor from concealing a shutdown followed by unauthorized boot to a diagnostic subsystem that would allow file manipulations without log records to track the operations. The boot records would show an unexplained gap between shutdown and boot.

53.5.2.3 Process Initiation.

A process begins when a specific program is loaded and run by a particular user at a particular time. Log records for process initiation show when the various processes were initiated and who initiated them. These files provide a method of tracking employee activity as well as monitoring events that occur. In addition, such records allow cost recovery using chargeback at different rates for different programs or for program launches at different times of day. More important, the record of which programs were executed by whom at which times can be invaluable in forensic research.

53.5.2.4 Process Termination.

When reviewing the process termination log record, an administrator will be able to tell when each process completed or was terminated for some other reason. Some systems may provide more information, such as why an unscheduled or abrupt termination occurred, but not all process termination log records provide that information. Process termination records typically include valuable statistical information such as:

  • Which process spawned or forked the process in question
  • Identification of any processes spawned by the processes
  • Number of milliseconds of CPU used
  • Number of files opened and closed by the process
  • Total number of input/output (I/O) operations completed by the process
  • Total size of the memory partitions allocated to the process
  • Maximum size of the data stack
  • Number and maximum size of extra data segments in memory
  • How many swaps to virtual memory were needed during the existence of the process
  • Maximum priority assigned to the process for scheduling by the task manager

53.5.2.5 Session Initiation.

A session consists of the communications between a particular user and a particular server during a particular time period. Whenever a user logs on to a system and initiates a session, a record of that event can be found in the session initiation log record. Frequent review of these records provides important information to alert administrators that an intruder is in the system. For instance, if an administrator knows that an authorized user is on vacation and the session initiation log file shows that a session was initiated for that particular user's ID, chances are significant that an unauthorized party used the system via a borrowed or stolen user ID. These records are particularly valuable during forensic work.

53.5.2.6 Session Termination.

When a session terminates, for whatever reason, a copy of the time when it terminated is generally stored in the session termination log record. Much like the process termination record, the session termination record can include a great deal of aggregated information about the activities carried out during the session, such as total I/O, total number of processes launched and terminated, total number of files opened and closed, and so forth.

53.5.2.7 Invalid Logon Attempts.

The invalid logon attempt file can prove invaluable in cases where logon attempts do not succeed. In some instances, the file can tell if the user attempted to log on with an incorrect password, if the user exceeded the allowed number of failed attempts, or if the user was attempting to log on at a seemingly unusual time. These log records can provide important information in cases where an administrator is attempting to track specific actions to a user or to ascertain if the logon failure was due to a simple error or to an attempted impersonation by an unwanted outsider. In hardwired networks, where every device has a unique identifier, the records usually include a specific identifier that allows administrators to track down the physical device used for the attempted logons.

53.5.2.8 File Open.

The file open log record provides information on when each specific file was opened and by which process; in addition, the record generally records the mode in which the file was opened: for example, exclusive read and write, exclusive read, exclusive write with concurrent read, append only, or concurrent read and write.

53.5.2.9 File Close.

The file close log record gives an administrator information regarding when the file was closed, by which process, and by what means. The file usually captures information on whether the user specifically closed the file or whether some other type of interruption occurred. The records usually include details of total read and write operations, including how many physical blocks were transferred to accomplish the total number of logical I/O operations.

53.5.2.10 Invalid File Access Attempts.

An important log record in the M&C effort, the invalid file access attempt shows the administrator when and to which files there were invalid file access attempts. The records generally show which process attempted the I/O and why it was refused by the file system (e.g., attempted write to a file opened for read-only access, violation of access control list, or violation of file-access barriers).

53.5.2.11 File I/O.

Whenever information is placed into, read out of, or deleted from a file (input/output), the information regarding those changes is captured in the file I/O log. As mentioned earlier, the log includes images of a record before and after it was accessed. The I/O log can be used in file recovery after system or application crashes. Coupled with transaction-initiation and termination records, such data can be used for automatic roll-back or roll-forward recovery systems.

The activities recorded here can prove especially helpful when trying to validate actions that were taken and to attribute them to specific individuals. Detailed logs are typical for databases, where a subsystem provides optional logging of all I/O. Application log records, designed and programmed by application developers, also typically allow administrators to enable such detailed logging.

53.5.2.12 System Console Activity.

The system console activity file provides information on any actions that originate from or are viewed at the system console. Typically, the system console includes not only logon and logoff records but also special requests, such as printer form mounts, specific tape or cartridge mounts, comments sent to the console by batch jobs, and free-form communications from users. The console file records all such activity as well as every command from the system operator, and the system responses to those commands. These records provide an excellent tool for investigators tracking down the specific events in a computer incident.

53.5.2.13 Network Activity.

Network activity files provide valuable information on activity taking place on the network. Depending on the sophistication and settings of the system an administrator is using, the information derived can be plentiful or scant. Specific devices may generate their own records on network activity; for example, routers, gateways, and firewalls may all keep their own log files. However, typically these are circular files in which records to be entered after the file is full are shifted to the start of the file, where they overwrite the oldest records. In forensic work, it is essential to capture such data before the information of interest is obliterated. Unfortunately, in many systems, the volume of network activity is so high that log files contain only the most recent minutes of traffic.

53.5.2.14 Resource Utilization.

A review of the resource utilization log records will show all of the system's resources and the level of utilization for each. By monitoring this file, administrators frequently make important decisions regarding modifying system configuration or expanding the system.

53.5.2.15 Central Processing Unit.

The CPU file shows the capacity and usage of the central processing unit for whichever system is being used and monitored. Based on this information, administrators can monitor when usage is heaviest and the CPU is most stressed, and can decide on utilization rules and requirements as well as on possible CPU upgrades. As in all log file analyses, any outliers (unusual values) and any unexpected change in usage can be investigated. Global CPU utilization records can be compared with the sum of CPU usage collected from process termination records; discrepancies may indicate stealth operation of unauthorized processes, such as malicious software.

53.5.2.16 Disk Space.

The log records for disk space show the amount of disk space originally available on a system, the amount of disk space used (and generally what type of files it is being used for), and the amount of disk space that remains free and available. Comparison of total disk space utilization with the total space allocated to all files can reveal problems such as lost disk space (i.e., space allocated to files that were never closed properly; such space is unusable by the file system because there are no pointers indicating that the disk sectors are actually supposed to be free). Such unallocated sectors or clusters also may be where malefactors hide data they have stored without authorization by bypassing the file system.

53.5.2.17 Memory Consumption.

Important information on the amount of memory in a system and the amount actually being used can be obtained from the memory consumption log records. Such records typically also include information on virtual memory usage. Therefore, they can provide warning of thrashing conditions, where memory segments are being copied to disk and read back from disk too often. Details of memory consumption may be useful in tracking down unauthorized processes such as worms.

53.5.2.18 System Level versus Job Level.

Different levels of monitoring, logging, and auditing may take place. Some of this may be preprogrammed, but in most environments it is possible to select the log records and related reports that are desired. For instance, a regular system level audit log may be produced at intervals, just to provide information that the system is up and running without any extraordinary problems. But an in-depth job level audit may take place more frequently, in order to monitor the specific job and ensure that it is running correctly and providing the information needed.

53.5.3 Automation and Resource Allocation

The importance and value of log management cannot be overemphasized. Utilizing secure, well-organized log management systems to monitor and respond to events can lower the administrative effort required to manage an infrastructure. With the growing demands of compliance and costs associated with data retention, it is important to include log management within the scope of the enterprise data retention policy. E-discovery laws and postmortem attack analysis rely on efficient access to accurate log information. Logs require system resources, ranging from media storage to CPU cycles and memory utilization. Tradeoffs must be considered when setting aside potential business-generating resources for the purpose of infrastructure management.

Manually reviewing millions of log files on various systems, in different time zones and inconsistent formats, is unrealistic for staff. For geographically dispersed sites, real-time log transfers may be delayed due to limited bandwidth available via slow links. Not all logged information is of equal value. Some data records may tie directly to compliance issues, while others provide noncritical benefits to daily operations. Based on policy, management must determine what types of data must be aggregated and reduced in order to monitor and to ensure ongoing compliance. This is part of the scope definition process of M&C systems. Planners and implementers must not only scope current requirements but must ensure that the log management plan will scale to address future regulations and internal security requirements.

53.5.4 Log Record Security

An organization must protect its log records from unauthorized access and modification. The four most common methods of log record security are: access control lists, checksums, encryption, and digital signatures. Combining these methods, based on the system requirements, can adequately protect most log data. Attackers with unfettered access to unencrypted e-mail log files could download copies of the files and search them for sensitive information as well as modify or delete message content directly from the target server. This type of attack threatens every element of the Parkerian hexad.16 In addition to the common concerns of confidentiality breaches and loss of data integrity, unauthorized access to log files may provide an attacker with additional information on the network environment. Vulnerability scanners may run at scheduled intervals and log detected issues to a central database. Leveraging aggregated data, an attacker can more efficiently compromise a network environment and evade detection routines accustomed to high-footprint reconnaissance.

Operating system and application-level access control lists (ACLs) restrict access to files. Some log files may be written to write-once, ready-many (WORM) media for auditors or legal staff. Operations staff may have no legitimate need to access such data. However, they may need to ensure that the communication flow operates continually. Databases can implement even more granular controls on specific records and fields, to avoid unnecessary information disclosure. Physical access controls to the media while in storage or in transit must also be considered when planning a M&C system for log data. Properly implemented ACLs can limit the level of log data exposure to authorized personnel and system processes.

Organizations subject to PCI DSS are required to implement integrity controls on log data. Historically, 32-bit cyclic redundancy checks (CRCs) generated checksums to guard against file alteration. Today, stronger cryptographic checksums, such as those generated by the SHA-2 family, provide a more reasonable level of integrity assurance.17 Checksums do not require constant human interaction. For tamper protection, a monitoring system may baseline a system and store copies of files or their hash equivalents for periodic or real-time file comparison. Changes to logs or other critical files could then be centrally reported for manual investigation. A control system could automatically quarantine the unauthorized file and restore a copy of the original file.

Digital signatures use public key cryptography to provide data integrity and non-repudiation. These build on standard cryptographic checksums because they not only confirm file integrity but also identify who or what entity is providing that assurance. This becomes significant when controlling software distribution. Many organizations download third-party software updates over the Internet. To ensure package integrity, vendors can digitally sign the files to ensure that the recipient received an unaltered software update. For organizations that must submit reports to external recipients, such as government offices, digital signatures can decrease the paper trail and provide a more efficient, digital means of meeting regulatory requirements.

Encryption must be considered for log data in transit and at rest. Encrypting entire log files reduces the ability to view or alter the contents. It also increases the resource usage on the target system during the encryption process. Like checksums, the keys required for decryption should be stored in a secure location, to increase the difficulty in defeating the control. Utilizing whole-disk encryption and then transferring a log file over a network in plaintext is not a consistent method of log security. Establishing secure network links, such as implementing IPSec, provides end-to-end link security. A combination of network and system-level encryption may be necessary to secure log files and other sensitive information properly.

The chain of custody should also be considered when planning a log management system. Some log data may need to be used in court. Without a verifiable audit trail, the contents may be inadmissible. A documented chain of custody for media transfers between staff and sites can validate physical controls. Implementing cipher block chaining (CBC) can further protect against unauthorized changes to log data. The previous log file ciphertext would XOR with the current plaintext to create a new value for encryption operations. This concept can also be useful for record sets within log files and database applications, as it will identify record deletions and other data modifications. See Chapter 7 in this Handbook for more information on cryptography.

53.6 DATA AGGREGATION AND REDUCTION

53.6.1 Centralized Data Stores.

Systems generate logs in a variety of formats that are not always easy to access. The first phase in data aggregation is ensuring that logging is working properly on the target system. The second phase, gradually connecting logging points to a centralized data store, leverages existing logging mechanisms to create an enterprise view of events. Centralized data storage provides a single point of analysis and audit, without the time-consuming process of manually interrogating individual systems throughout the environment. Log data must be communicated in a secure manner. An agent-based system needs to establish a secure communication channel with the primary collector, to avoid distributing log details to an unauthorized collector. This security measure may be built into the aggregating application, or it may utilize IPSec over controlled network links.

A major problem with large volume data stores, processing time, creates a conflict between complete log processing and incident response time. Depending on the resources available, it may be more economically advantageous to store secured log data in compressed archives while using agent-based monitoring only to report critical security issues or operational problems to central storage. The time and effort to log, parse, and report on every data block received by the central collector may exceed internal security policy requirements and the perceived cost benefit. Filtering out noncritical events and alerts would then be part of the baselining process for the central M&C system. Storing and analyzing entire log files may serve as a secondary (slower) monitoring mechanism, while the security specific alerts are handled on an immediate basis.

53.6.2 Filtered Queries.

The volume of log file records can be overwhelming and generally impossible for operations staff to review individually and on a regular basis. Centrally storing the data does not change the infeasibility of the task. It does however lend itself toward improved data reduction. Transforming raw data into a meaningful, reduced form can minimize the noise and enable staff and automated processes to target higher-priority activities. Running filtered queries against log records can provide more immediate results on potential attacks or system issues. A scheduled query may list all failed logon attempts. Additional filtering may drill down to a set of computing assets (e.g., servers processing credit card data) or user types such as root-level accounts. Another more specific query may only list network traffic anomalies between the R&D Department and an array of Internet proxy servers.

53.6.3 Analyzing Log Records.

Log records can be valuable to the system administrator, but only if they are put to use. Properly using a log file requires monitoring and reviewing its records and then taking actions based on the findings. Such analysis typically uses utilities provided by the supplier of the operating system or by third parties, such as commercial software companies or freeware distributors.

53.6.3.1 Volume Considerations.

Someone must make a decision regarding how big log files should be. Such a decision is generally made by the system administrator and is based on what is considered useful and manageable. If money is involved—for instance, if additional storage capacity is needed—the chief technology officer (CTO), chief information officer (CIO), chief operating officer (COO), system auditor, and a variety of other staff members might be called on to review the activities and volumes of the logs and to participate in any decisions regarding expenditures for new equipment. However, in the three decades preceding this writing in 2008, disk space costs have fallen about 40 percent per year compounded annually—from approximately $488,000 per gigabyte to approximately $0.29 per gigabyte in constant dollars, so disk space is not much of an issue anymore.18 More important is that log files, like any file, may be left in an inconsistent state if the system crashes; closing a log file and opening a new one in the series is a prophylactic measure that ensures that less information will be lost in disk I/O buffers if there is a system crash.

53.6.3.2 Archiving Log Files.

Archiving is a very important part of making and using log files. Every company needs to decide how long to keep its log files and where to keep them. Decisions of this nature sometimes are made based on space considerations, but more frequently are made based on legal requirements within specific industries. Some companies may decide to keep materials for a year or two, while others may be mandated by law to retain archived log files for seven or eight years.

The decisions regarding the archiving of log files should never be made arbitrarily; in every instance a careful review and check of company, legal, and industry requirements should be conducted. Based on the findings of that review, a written policy and procedures should be produced. Every company needs to tell its employees what these policies and procedures are, so they can be aware of what materials are being kept, where they can be found, and how they can be accessed.

53.6.3.3 Platform-Specific Programs for Log File Analysis.

Many operating systems are in existence, and although there are certain similarities in how they function, there are also differences. Special training on each specific operating system may be necessary in order to do a correct analysis of the platform-specific log files it generates.

53.6.3.4 Exception Reports.

The volume of log file records can be overwhelming. For example, a multiuser system with 10,000 users per day (e.g., a large automated banking application) could generate millions of records of the logons, logoffs, file openings, file closings, and record changes alone. Exception reports allow the analyst to focus on specific characteristics (e.g., withdrawals of more than $200 from ATMs at a specific branch between 2:00 A.M. and 3:00 A.M. on last Tuesday night) or on statistical characteristics, such as the largest withdrawals in the network at any time. Such tools greatly simplify detection and analysis of anomalies and are particularly helpful in tracking down criminal behavior or other unauthorized activities.

53.6.3.5 Artificial Intelligence.

In our current advanced electronic society, artificial intelligence (AI) programs have been developed that actually can check user profiles for unusual and suspicious activities and generate an alert and a report whenever these types of activities are noticed. Because of the preprogrammed artificial intelligence and the parameters set, instead of having to launch the program and monitor it, the system launches the program on a regular basis and collects the data. Anything of an unusual nature causes an alert to be generated.

53.6.3.6 Chargeback Systems.

Chargeback systems, such as internal billing services and those used by external service bureaus, prove valuable in monitoring and tracing system activities through their logs. In some states and cities, legislation requires that these types of services keep accurate logs for audit purposes. The original records are used not only for tracking but also in court cases as evidence. In most situations, both internal and external services are the secondary “proof” of transactions, while the initial system logs provide the primary proof.

The log file records provide the basis for sophisticated billing algorithms that encourage rational use of resources. In addition, because unexpected changes in expenses can stimulate interest by accounting staff and managers, billing systems sometimes can help identify unauthorized usage, such as online gambling, excessive downloads from the Internet, and installation of unauthorized code for processing, using idle time on a computer.

53.6.4 Dashboards.

Dashboards consolidate information into a relevant and easily understood summary. Text, visual, and auditory elements may be employed, depending on management needs and system capability. Dashboards build on Web-based technology with a database backend to present a high-level overview of operational status at near-real-time refresh rates. Charts, graphs, LED-style indicators, system visualizations, and alert banners enhance the dashboard experience when used in moderation. Employees typically reference dashboards from a standard Web browser or on a large video screen mounted in a common area.

Department and management-specific dashboards should also be considered when implementing specialty M&C systems. Depending on the sensitivity of the information, rotating the dashboards on common screens or within the Web portal itself may provide an opportunity to increase employee awareness of company-wide and department-specific performance. Associates may learn from Internet security sources, the current level of business risk, of current operational performance (e.g., system uptime, and product generated, or sales volume), or even of the current state of cybersecurity.19

Rather than implement dashboards, some organizations continue to rely on a myriad of system consoles to monitor and control specific elements within an infrastructure. As a result, they miss out on the benefits of quickly understanding the state of the environment by leveraging enterprise data reduction tools through a ubiquitous Web interface. System management consoles often cater more to technical operations staff than to upper management. Some vendors attempt to provide all-in-one consoles for various system management tasks. However, they generally do not provide a broad, holistic view of the business environment. As a result, system management consoles and dashboards will likely continue to coexist for the foreseeable future.

53.7 NOTIFICATIONS AND REPORTING

53.7.1 Alerts.

Log management must include layers of automation to reduce the time required to review and respond to events in the environment. How an M&C system deals with alerts will directly impact the security of the target system. Overwhelming operators with too many alerts will result in slower problem-response time as well as the potential to ignore critical issues. The purpose of alerts is to improve responsiveness to major issues, such as a leaking pipeline or multisite network outage. Alerts may be configured at specific set points, so that a responder will be notified only when the attack threshold is reached. Elements that are controlled in the environment may have a higher threshold for alert notification. A sudden influx of malware from the Internet may be immediately detected and filtered at the network edge. Notifying operations staff of a large volume of malware may be useful. However, spamming security administrators with alerts on every blocked attack would be an inefficient use of alert notification resources.

Additionally, the means of alert notification play an important role in business continuity and disaster recovery. Out-of-band monitoring systems may detect a network equipment error that could not otherwise be detected by the failing device itself. For alert recipients, multiple notification means may be necessary. E-mail, pager, cell phone, and SMS messaging are the most common forms. Alert visualization is another valuable means for operators actively monitoring the environment. Similar to the dashboard concept, visualization can reduce data collected from sensors into a visual map, with graphical indicators to identify system issues. A combination of dashboard-style alert monitoring and targeted message delivery may provide the best coverage for issues of high importance.

In industrial environments, operators utilize HMI tools to maintain situational awareness with field equipment. HMI uses virtual buttons and other visual elements to simplify industrial functions. If alerts are frequently generated for noncritical activities, staff will clear the alert list and ignore them over time. Ignoring alerts for critical issues can directly impact production and lead to costly repairs. In this situation, a management dashboard could display performance statistics that may trigger a field investigation into the root cause of declines in production performance.

53.7.2 Trend Analysis and Reporting.

Data aggregation, reduction, and correlation lay the foundation for trend analysis. Data related to compliance requirements can be analyzed to determine the pace of security improvements within the organization. Further analysis may identify the level of consistency of an internal security program as well as the true drivers of audit compliance. If system vulnerabilities and business risk continually peak in between scheduled external audits, management will need to investigate the root cause of this unstable, cyclic behavior.

Chargeback refers to the ability to charge for the use of IT services. Some organizations include these costs in the overhead of doing business rather than billing individuals and departments. Regardless of the financial preference, enabling chargeback monitoring in an environment provides an understanding of how information resources are used. If the Marketing Department utilization has grown from 40 to 70 percent of the Internet bandwidth, and the Help Desk use has expanded from 30 to 50 percent of the grid computing resources, the trends may encourage management to realign resources to meet business demand or investigate resource abuse. Monitoring resource utilization and demand can also benefit development and testing cycles, so that production resources are properly allocated.

Exception reports focus on specific data anomalies, such as employee expense reports that suddenly exceed USD $3,000. Managers may want to employ continuous monitoring of the financial system to evaluate exception report trends. In addition to misuse and fraud, exception reports can identify system problems. Generating exception reports for systems that regularly operate below acceptable performance levels may point to a failing component. Management, at an industrial site that regularly fails to account for missing product, may uncover procedural issues that result in wasted material or defective equipment that must be repaired to resolve the ongoing product loss. Exception reports may also identify an isolated issue or a widespread problem. If change management recorded the replacement of hardware on five pipeline segments and they all started leaking at the same time, aggregated log data and exception reports would support the need for hardware and procedural reviews. Or an exception report, based on sensor data collected in November 2002 along the Denali Fault,20 may correlate with an environmental issue (e.g., earthquake) that would possibly explain the cause of an equipment malfunction.

53.8 MONITORING AND CONTROL CHALLENGES

53.8.1 Overview.

Legacy systems do not always possess sufficient M&C for today's business and regulatory requirements. The art of retrofitting security into a legacy system requires overcoming physical, technical, and psychosocial barriers. Physical barriers range from wired ICS installations to wireless mobile communications. Technical issues include learning to detect and monitor unique communication protocols and managing systems that historically operated in isolated locations with little or no security oversight. Psychosocial challenges are often the most difficult. Organizations and business sectors unaccustomed to mature M&C systems typically deny the need for such systems. Rationalization attempts generally include denial that an attacker would ever target their systems, claims that the business is running well, and concerns that additional changes or complexity will likely disrupt production. Security awareness training can address these misperceptions, while pointing to the lack of M&C data available to support any claims that operations is running as efficiently and securely as it should.

53.8.2 Industrial Control Systems.

Industrial control systems (ICSs) operate in a number of industries, including oil, water, and power. ICSs are generally divided into distributed control systems (DCSs) and supervisory control and data acquisition (SCADA). DCSs are more commonly associated with processes such as oil refining. SCADA systems involve more human interaction via HMI, and they collect data from a number of remote devices. Some systems contain elements of both DCS and SCADA. Security issues have increased, due to network connectivity and perceived risk to the critical infrastructure of the United States. These systems are expected to continue to integrate on a global scale, with improved, centralized data aggregation and reporting. Management will find value not only in aggregating their own organization's data but in collaborating in industry forums to implement more stable, interoperable, and secure industrial M&C systems. Like mainframes, SCADA systems no longer operate in an isolated environment. Technicians dial in with a modem or through a virtual private network (VPN) tunnel over the Internet to manage a PLC remotely. Remote sites often communicate via leased lines, satellite, and dial-up modems. Some industrial sites utilize Zigbee wireless equipment rather than installing standard cabling to monitor and control remote devices.

M&C challenges for these systems include authentication, network security, configuration protection, and physical security. Historically, authentication has not been required to use a PC to communicate directly with a PLC over a SCADA network. Some vendors now attempt to build authentication into the HMI. However, the network protocols used in SCADA (e.g., MODBUS/TCP, DNP3) can still send unsolicited messages to the PLC to execute arbitrary commands. Configuration protection is typically associated with the running program on a PLC, which may have a weak or nonexistent password. From the operator standpoint, HMI usually runs on an unpatched PC with application-level security controls. These controls can be bypassed by attacking the improperly secured, underlying operating system. Physical security remains a challenge, as equipment may not be located in a secured area, or operators may lack adequate security awareness training to implement proper security procedures. Fortunately, these M&C systems may leverage existing information security resources to mitigate risk in a layered architecture. NIST SPEC PUB 800-82 clearly describes the security landscape surrounding these systems. Network firewalls, intrusion prevention systems, and more robust physical security would all contribute to a more secure SCADA environment. Vulnerability detection tools now target SCADA protocols, with the goal of detecting vulnerabilities and mitigating risk. Developing more secure communication protocols, and establishing configuration protection guidelines, would further improve the authentication weaknesses in SCADA deployments.

53.8.3 Mobile Computing.

As we continue to integrate mobile and distributed computing systems, the complexity and difficulty required to secure systems increases. We have more variables to address. Information flows even further from the traditional network perimeter. With advances in deperimeterization, M&C systems must ensure that data remains secure while in transit and at rest, as determined by organizational policy. Mobile technologies communicate over wireless links, invisible to traditional M&C systems. Some wireless devices require centralized system management but rarely connect to the corporate network. Mobile encryption and mandatory VPN connectivity help to control data storage and transmission. With centralized policies, these systems ensure consistent data communications throughout an organization.

Radio-frequency identification (RFID) monitoring systems wirelessly identify and collect information from the target system. NIST 800-98 describes the technological benefits and risks related to implementing this technology. M&C systems often collect sensitive information. The privacy issues inherent in RFID will require thorough risk assessment, to avoid potential data loss. Tagging all of an organization's employees with RFID badges that contain PII subject to HIPAA regulations would be unwise, as the RFID tags could potentially be accessed from over one meter away, without the holder's knowledge. From an inventory perspective, RFID tags provide a means to automate site audits of company assets or to collect asset information without manually writing it down. To avoid compromise, the device standards use integrity checks (CRC) and require a password to render them nonresponsive.

53.8.4 Virtualization.

Organizations use a variety of computing equipment to meet their business needs. Some equipment may remain underutilized, while critical database servers require additional disk space and working memory. Virtualization provides the ability to slice and pool computing resources into more efficient, targeted, and often portable business resources. Virtual systems generally consist of physical hardware, a virtualization interface (VI), and a virtual machine (VM). The VM may be an application or an entire operating system. Since the VI oversees the VMs, we must implement appropriate M&C at the VI layer as well as within and around the VMs.

Users of a virtualized operating system may not be able to detect that it is virtualized. For example, a Web hosting company may wish to provide dedicated virtual servers to its customers without the additional expense of dedicated hardware. In this situation, the organization could deploy FreeBSD servers, implement a degree of virtualization using “jails”21 to provide a high level of compartmentalized, system-level access to customers. An operator with root access to a jailed environment should not see the processes executed by other jails nor the underlying system hosting the jails. However, the host system owner may monitor or control elements of the host system as well as those within jailed environments. Additional monitoring of the host system may also be necessary, to detect jailbreak attempts.

Another type of virtualization, paravirtualization, utilizes a hypervisor as the VI between the hardware and the VM. Rather than create identical jailed environments, a hypervisor can be used to support more than one type of guest VM, such as Ubuntu, Fedora, and PC-BSD. This configuration is possible because the hypervisor handles all communications (hypercalls) between the hardware and the VMs. Along the same lines as a FreeBSD host with jails, the hypervisor can monitor and control activities within and around the VMs with varying degrees of transparency to the VM user. The capability of stealth VM monitoring makes virtualization a useful monitoring tool for honeypot research. For a greater level of system assurance, organizations will need tools to detect and prevent attacks against the hypervisor.

Portability, an advantage of virtualization, poses a challenge for M&C systems. VMs can migrate from one hardware device to another frequently, while the integrity of an off-line VM is unknown. In a managed, virtualized infrastructure, performance degradation issues detected in the VI may trigger VMs to migrate from one physical server to a less utilized resource. An unstable VM may also be manually forced off-line and rapidly redeployed using a backup virtual disk, without disrupting other VMs sharing the same physical hardware. Tracking these moving targets requires reliable access to the virtual pool of resources, documentation mapping virtual to physical resources, as well as a clearly defined policy for virtual system provisioning. For example, an organization with policies that require clear separations between confidential and public systems should prohibit the same hypervisors from managing both payroll applications and public kiosk VMs on the same hosts.

53.9 SUMMARY.

Monitoring and control systems protect our business environments from undesirable events and operational inefficiency. We may never control every element of our information systems. However, we must be able to identify what needs to be monitored and what must be controlled. Regulations and internal policies define those needs. Frameworks, such as COBIT, provide a means to organize controls into specific objectives. The selection of an appropriate system model is based on scope and the business requirements. Human interaction plays a critical role in many legacy systems. However, to meet the growing need for reporting and continuous assurance, organizations must integrate automation into M&C systems. Even with automation, unique challenges surface in industrial controls, mobile computing, virtual systems, grid computing, and other network environments. In these situations, we must return to the fundamental questions of M&C:

  • What are the compliance and internal policy requirements?
  • What systems, components, or processes need to be monitored or controlled?
  • What model will be used to monitor or control the target?
  • Where will the data be stored?
  • How will the data records be secured?
  • What kind of data reduction and reporting is required?
  • Who or what will respond to detected issues?
  • What measures exist to protect M&C systems from compromise?

Monitoring, control, and auditing interrogate information systems, identify problems, initiate corrective measures, and generate meaningful information on the state of our infrastructure. The systems may involve human interaction or function in an automated, closed-loop format. Systems generate log files based on system activities, store them in secure locations, and enable aggregators to collect and reduce the data into actionable information. Although log aggregation improves the analysis process, the resources required to store and manipulate the data can become problematic. Not all events are useful. In order to manage reasonable data volumes, only specific activities should be logged or collected by the central management system. Log file data provides a means to analyze events historically at any measured point in time—a necessary part of infrastructure management. Organizations need to understand the value of assets, align business with regulatory requirements, and find new ways to leverage M&C to improve operational performance.

53.10 REFERENCES

Control Objectives for Information and Related Technology (COBIT): www.isaca.org/cobit.

ISO/IEC 17799:2005: www.iso.org/iso/en/prods-services/popstds/informationsecurity.html.

Karygiannis, T., B. Eydt, G. Barber, L. Bunn, and T. Phillips. NIST Special Publication 800-98. Guidelines for Securing Radio Frequency Identification (RFID) Systems. U.S. Department of Commerce, April 2007, http://csrc.nist.gov/publications/nistpubs/800-98/SP800-98_RFID-2007.pdf.

Payment Card Industry Data Security Standard (PCI DSS): www.pcisecuritystandards.org.

Sarbanes-Oxley Act of 2002. 15 U.S.C. 7201. Public Law 107-204. 107th Congress; www.sec.gov/about/laws/soa2002.pdf.

Stouffer, K., J. Falco, and K. Kent. NIST Special Publication 800-82. Guide to Supervisory Control and Data Acquisition (SCADA) and Industrial Control Systems Security. U.S. Department of Commerce, September 2006, http://csrc.nist.gov/publications/drafts/800-82/2nd-Draft-SP800-82-clean.pdf.

53.11 NOTES

1. For more information on intrusion detection systems and unified threat management appliances, see Chapter 27 in this Handbook.

2. For more information about DoS attacks, see Chapter 18 in this Handbook.

3. For more information on incident response, see Chapter 56 in this Handbook.

4. Legislation on Notice of Security Breaches. National Conference of State Legislatures, www.ncsl.org/programs/lis/cip/priv/breach.htm.

5. For more information on physical and facilities security, see Chapters 22 and 23 in this Handbook.

6. See Chapter 28 in this Handbook for discussion of identification and authentication.

7. For more information on software development and quality assurance, see Chapter 39 in this Handbook; for information on managing software patches, see Chapter 40.

8. U.S. Department of Navy, “The Navy Unmanned Undersea Vehicle (UUV) Master Plan,” November 9, 2004, www.navy.mil/navydata/technology/uuvmp.pdf.

9. Department of Defense Directive No. 8100.1, “Global Information Grid (GIG) Overarching Policy,” September 19, 2002, www.defenselink.mil/cio-nii/docs/d81001p.pdf.

10. Federal Energy Regulatory Commission, “Existing LNG Terminals,” www.ferc.gov/industries/lng/indus-act/terminals/exist-term/everett.asp.

11. Consortium for Electric Reliability Technology Solutions, “Demand-Response, Spinning-Reserve Demonstration—A New Tool to Reduce the Likelihood of Outages,” News release, July 26, 2006, http://certs.lbl.gov/press/press-7-26-06.html.

12. Zigbee Alliance: www.zigbee.org/.

13. U.S. Environmental Protection Agency, “Suggested Pre-Hurricane Activities for Water and Wastewater Facilities,” www.epa.gov/safewater/hurricane/pre-hurricane.html.

14. Helix by e-Fense: www.e-fense.com/helix/.

15. CERT® Advisory CA-2003-04 MS-SQL Server Worm, January 27, 2003, www.cert.org/advisories/CA-2003-04.html.

16. See Chapter 3, page 3 in this Handbook.

17. J. Wack, M. Tracy, and M. Souppaya, NIST Special Publication 800-42, Guideline on Network Security Testing, U.S. Department of Commerce, October 2003, http://csrc.nist.gov/publications/nistpubs/800-42/NIST-SP800-42.pdf.

18. An HP7925 120MB disk drive cost $25,000 in 1980, about $36,366 in 2008 currency. The cost per GB in 1980 was ~$488,000 in 2008 dollars. In 2008, a Hitachi 7K1000 1 terabyte drive for sale at consumer stores costs ~$300, or about $0.29/GB. The ratio of the change in cost today versus in 1980 is ~6 × 10−7. Taking the 28th root of the change gives an annual cost change of ~60 percent per year (40 percent decline per year compounded annually).

19. Talisker Security Wizardry, The Computer Network Defence Internet Operational Picture, http://securitywizardry.com/radar.htm.

20. U.S. Geological Survey, Fact Sheet 017-03, “The USGS Earthquake Hazards Program in NEHRP—Investing in a Safer Future,” by J. R. Filson, J. McCarthy, W. L. Ellsworth, and M. L. Zoback. Last modified May 17, 2005. http://pubs.usgs.gov/fs/2003/fs017-03/

21. For more information on jails, see Chapter 15 of the FreeBSD Documentation Project's FreeBSD Handbook, contributed by Matteo Riondato: www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/jails.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.176.78