Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Sensors in the Service Domain

This chapter discusses specific sensors in the service domain. Service sensors, including HTTP server logs and mail transfer logs, describe the activity of a particular service: who sent mail to whom, what URLs were accessed in the last five minutes, activity that’s moderated through a particular service.

As we saw in the previous chapter, service domain data is log data. Where available, logs are often preferable to other sources because they are generated by the affected process, removing the interpretation and guesswork often needed with network data. Service logs provide concrete information about events that, viewed from the network perspective, are hard to reconstruct.

Logs have a number of problems, the most important one being a management headache—in order to use a log, you have to know it exists and get access to it. In addition, host-based logs come in a large number of formats, many of them poorly documented. At the risk of a sweeping generalization, the overwhelming majority of logs are designed for debugging and troubleshooting individual hosts, not for evaluating security across networks. Where possible, you’ll often need to reconfigure them to include more security-relevant information, possibly needing to write your own aggregation programs. Finally, logs are a target; attackers will modify or disable logging if possible.

Logs complement network data. Network data is good at finding blind spots, confirming phenomena reported in logs and identifying things that the logs won’t pick up. An effective security system combines both: network data for a broad scope, service logs for fine detail.

The remainder of this chapter is focused on data from a number of host logs, including system logfiles. We begin by discussing several varieties of log data and preferable message formats. We then discuss specific host and service logs HTTP server log formats, email log formats, and Unix system logs.

Representative Logfile Formats

This section discusses common logfile formats, including ELF and CLF, the standard log formats for HTML messages. The formats discussed here are customizable, and I will provide guidelines for improving the log messages in order to provide more security-relevant information.

HTTP: CLF and ELF

HTTP is the modern internet’s reason for existence, and since its development in 1991, it has metamorphosed from a simple library protocol into the internet’s glue. Applications where formerly a developer would have implemented a new service are now routinely offloaded to HTTP and REST APIs.

HTTP is a challenging service to nail down. The core is incredibly simple, but any modern web browsing session involves combining HTTP, HTML, and JavaScript to create ad hoc clients of immense complexity. In this section, we briefly discuss the core components of HTTP with a focus on the analytical aspects.

HTTP is fundamentally a very simple file access service. To understand how simple it is today, try the exercise in Example 5-1 using netcat. netcat (which can also be invoked as nc, perhaps because administrators found it so useful that they wanted to make it easy to invoke) is a flexible network port access tool that can be used to directly send information to ports. It is an ideal tool for quickly bashing together clients with minimal scripting.

Example 5-1. Accessing an HTTP server using the command line

host$ echo 'GET /' | nc www.google.com 80 > google.html

Executing the command in this example should produce a valid HTML file. In its simplest, most unadorned form, an HTTP session consists of opening up a connection, passing a method and a URI, and receiving a file in return.

HTTP is simple enough to be run at the command line by hand if need be—however, that also means that an enormous amount of functionality is handed over to optional headers. When dealing with HTTP logs, the primary challenge is deciding which headers to include and which to ignore. If you try the very simple command in Example 5-1 on other servers, you’ll find it tends to hang—without additional information such as the Host or User-Agent, the server will wait.

There are two standards for HTTP log data: Common Log Format (CLF) and Extended Log Format (ELF). Most HTTP log generators (such as Apache’s mod_log) provide extensive configuration options.

CLF is a single-line logging format developed by the National Center for Supercomputing Applications (NCSA) for the original HTTP server; the W3C provides a minimal definition of the standard. A CLF event is defined as a seven-value single-line record in the following format:

remotehost rfc931 authuser [date] "request" status bytes

Where remotehost is the IP name or address of the remote host, rfc931 is the remote login account name of the user, authuser is the user’s authenticated name, date is the date and time of the request, request is the request, status is the HTTP status code, and bytes is the number of bytes.

Pure CLF has several eccentricities that can make parsing problematic. The rfc931 and authuser fields are effectively artifacts; in the vast majority of CLF records, these fields will be set to –. The actual format of the date value is unspecified and can vary between different HTTP server implementations.

A common modification of CLF is Combined Log Format. The Combined Log Format adds two additional fields to CLF: the HTTP Referer field and the User-Agent string.

ELF is an expandable columnar format that has largely been confined to Microsoft’s Internet Information Server (IIS), although tools such as Bluecoat also use it for logging. As with CLF, the W3C maintains the standard on its website.

An ELF file consists of a sequence of directives followed by a sequence of entries. Directives are used to define attributes common to the entries, such as the date of all entries (the Date directive), and the fields in the entry (the Fields directive). Each entry in ELF is a single HTTP request, and the fields that are defined by the directive are included in that entry.

ELF fields come in one of three forms: identifier, prefix-identifier, or prefix(header). The prefix is a one- or two-character string that defines the direction the information took (c for client, s for server, r for remote). The identifier describes the contents of the field, and the prefix(header) value includes the corresponding HTTP header. For example, cs-method is in the prefix-identifier format and describes the method sent from client to server, while time is a plain identifier denoting the time at which the session ended.

Example 5-2 shows simple outputs from CLF, Combined Log Format, and ELF. Each event is a single line.

Example 5-2. Examples of CLF and ELF

#CLF
192.168.1.1 - - [2012/Oct/11 12:03:45 -0700] "GET /index.html" 200
1294

# Combined Log Format
192.168.1.1 - - [2012/Oct/11 12:03:45 -0700] "GET /index.html" 200 1294
"http://www.example.com/link.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

#ELF
#Version: 1.0
#Date: 2012/Oct/11 00:00:00
#Fields: time c-ip cs-method cs-uri
12:03:45 192.168.1.1 GET /index.html

Most HTTP logs are some form of CLF output. Although ELF is an expandable format, I find the need to carry the header around problematic in that I don’t expect to change formats that much, and would rather that individual log records be interpretable without this information. Based on principles I discussed earlier, here is how I modify CLF records:

Remove the rfc931 and authuser fields. These fields are artifacts and waste space.
Convert the date to epoch time and represent it as a numeric string. In addition to my general disdain for text over numeric representations, time representations have never been standardized in HTTP logfiles. You’re better off moving to a numeric format to ignore the whims of the server.
Incorporate the server IP address, the source port, and the destination port. I expect to move the logfiles to a central location for analysis, so I need the server address to differentiate them. This gets me closer to a five-tuple that I can correlate with other data.
Add the duration of the event, again to help with timing correlation.
Add the host header. In case I’m dealing with virtual hosts, this also helps me identify systems that contact the server without using DNS as a moderator.

Creating Logfiles

Log configuration in Apache is handled via the mod_log_config module, which provides the ability to express logs using a sequence of string macros. For example, to express the default CLF format, you specify it as:

LogFormat "%h %l %u %t "%r" %>s %b"

Combined Log Format is expressed as:

LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i""

This extended format contains the hostname, local IP address, server port, epoch time, request string, request status, response size, response time, Referer, User-Agent string, and host from the request:

LogFormat "%h %A %p %{msec}t "%r" %>s %b %T "%{Referer}i"
  "${User-Agent}i" "${Host}i""

Logging in nginx is controlled with HttpLogModule, which uses a similar log_format directive. To configure CLF, specify it with:

log_format clf $remote_addr - $remote_user [$time_local] "$request"
 $status $body_bytes_sent;

Combined Log Format is defined as follows:

log_format combined $remote_addr - $remote_user [$time_local] "$request"
 $status $body_bytes_sent "$http_referer" "$http_user_agent";

My extended format is defined as :

log_format extended $server_addr $remote_addr $remote_port $msec
 "$request$" $status $body_bytes_sent $request_time $http_referer
 $http_user_agent $http_host

Simple Mail Transfer Protocol (SMTP)

SMTP log messages vary by the mail transfer agent (MTA) used and are highly configurable. In this section, we discuss two log formats that are representative of the major Unix and Windows families: sendmail and Microsoft Exchange.

We focus on logging the transfer of email messages. The logging tools for these applications provide an enormous amount of information about the server’s internal status, connection attempts, and other data that, while enormously valuable, requires a book of its own.

Sendmail

Sendmail moderates mail exchange through syslog, and consequently is capable of sending an enormous number of informational messages besides the actual email transaction. For our purposes, we are concerned with two classes of log messages: messages describing connections to and from the mail server, and messages describing actual mail delivery.

By default, sendmail will send messages to /var/maillog, although the logging information it sends is controlled by sendmail’s internal logging level. The logging level ranges from 1 to 96; a log level of n logs all messages of severity 1 to n. Notable log levels include 9 (all message deliveries logged), 10 (inbound connections logged), 12 (outbound connections logged), and 14 (connection refusals logged). Of note is that anything above log level 8 is considered an informational log in syslog, and anything above 11 a debug log message.

A sendmail log line consists of five fixed values, followed by a list of one or more equates:

<date> <host> sendmail[<pid>]: <qid>: <equates>

where <date> is the date, <host> is the name of the host, sendmail is a literal string, <pid> is the sendmail process ID, and <qid> is an internal queue ID used to uniquely identify messages. Sendmail sends at least two log messages when sending an email message, and the only way to group those messages together is through the qid. Equates are descriptive parameters given in the form <key>=<value>. Sendmail can send a number of potential equates, listed in Table 5-1 for messages.

Table 5-1. Relevant sendmail equates
Equate	Description
`arg1`	Current sendmail implementations enable internal filtering using rulesets; `arg1` is the argument passed to the ruleset.
`from`	The from address of the envelope.
`msgid`	The message ID of the email.
`quarantine`	If sendmail quarantines a mail, this is the reason it was held.
`reject`	If sendmail rejects a mail, this is the reason for rejection.
`relay`	This is the name and address of the host that sent the message; in recipient lines, it’s the host that sent it, and in sender lines, the host that received it.
`ruleset`	This is the ruleset that processed the message, and provides the justification for rejecting, quarantining, or sending the message.
`stat`	The status of a message’s delivery.
`to`	The email address of a target; multiple `to` equates can appear in the same line.

For every email message received, sendmail generates at least two log lines. The first line is the receipt line, and describes the message’s point of origin. The final line, the sender line, describes the disposition of the mail, such as whether it was sent, quarantined, rejected, or bounced.

Sendmail will take one of four basic actions with a message: reject it, quarantine it, bounce it, or send it. Rejection is implemented by message filtering and is used for spam filtering; a rejected message is dropped. Quarantined messages are moved off the queue to a separate area for further review. A bounce means the mail was not sent to the target, and results in a nondelivery report being sent back to the origin.

Managing Email Rules and Filtering

Email traffic analysis is complicated, largely because email is attacked constantly (via spam), and there’s a constantly escalating war between spammers and defenders. Even in a relatively small enterprise, it’s easy to build a complex defensive infrastructure with relatively little work. In addition to the spam and defensive issues, email operates in its own little world—the IP addresses logged by email infrastructure are pretty much exclusively used by the email infrastructure.

As usual, the first step in email instrumentation is figuring out how email is routed. Is there some kind of dedicated antispam hardware at the gateway, such as a Barracuda or an IronPort box? How many SMTP servers are there, and how do they connect to the actual email servers (POP, IMAP, Exchange)? Figure out where a mail message will be sent if it’s correctly routed, quarantined, rejected, or bounced. If webmail is available, figure out where it actually is; where is the webmail server, what’s the route to SMTP, etc.

Once you’ve identified the hardware, figure out what blocking is going on. Blocking techniques include black-box sources (such as AV or IronPort’s reputation service), public blacklists such as SpamHaus’s SBL, and internal rules. Each requires a little different treatment.

Since black-box detection systems are basically opaque, it’s important to track what version of the system’s knowledge base is being used and when the system is updated; verifying updates with network monitoring is a good idea. If you have multiple instances of the same detector, make sure that their updates are coordinated.

Most blacklist services are publicly accessible. Knowing which organization runs the blacklist, the frequency of its updates, and the delivery mechanisms are all good things. As with AV, verifying communications (particularly if its a DNS block list) is also a good thing.

Internal monitoring should be identified, audited, and kept under version control. Because these are the rules that you have the most control over, it’s also a good idea to compare them to the rest of your blocking infrastructure and see what can be pushed out of the email system. If you’re blocking a particular address, for example, you might be better off blocking at the router or the firewall.

Email works within its own universe, and the overwhelming majority of IP addresses recorded in email logs are the addresses of other email servers. To that end, while SMTP tracking is important, it’s often the case that to fully figure out what happened with a message, you also need to track the IMAP or POP3 servers.

Microsoft Exchange: Message Tracking Logs

Exchange has one master log format for handling messages, the Message Tracking Log (MTL). Table 5-2 describes the fields.

Table 5-2. MTL fields
Field name	Description
`date-time`	The ISO 8601 representation of the date and time.
`client-ip`	The IP address of the host that submitted the message to the server.
`client-hostname`	The `client_ip`’s fully qualified domain name (FQDN).
`server-ip`	The IP address of the server.
`server-hostname`	The `server_ip`’s FQDN.
`source-context`	Optional information about the source, such as an identifier for the transport agent.
`connector-id`	The name of the connector.
`source`	Exchange enumerates a number of source identities for defining the origin of a message, such as an inbox rule, a transport agent, or DNS. The `source` field will contain this identity.
`event-id`	The event type. This is also an enumerable quantity, and includes a number of status messages about how the message was handled.
`internal-message-id`	An internal integer identifier used by Exchange to differentiate messages. The ID is not shared between Exchange servers, so if a message is passed around, this value will change.
`message-id`	The standard SMTP message ID. Exchange will create one if the message does not already have one.
`network-message-id`	This is a message ID like `internal-message-id` except that it is shared across copies of the message and created when a message is cloned or duplicated, such as when it’s sent to a distribution list.
`recipient-address`	The addresses of the recipients; this is a semicolon-delimited list of names.
`recipient-status`	A per-recipient status code indicating how each recipient was handled.
`total-bytes`	The total size of the message in bytes.
`recipient-count`	The size of `recipient-address` in terms of number of recipients.
`related-recipient-address`	Certain Exchange events (such as redirection) will result in additional recipients being added to the list; those addresses are added here.
`reference`	This is message-specific information; the contents are a function of the type of message (defined in `event-id`).
`message-subject`	The subject found in the `Subject:` header.
`sender-address`	The sender, as specified in the `Sender:` header; if `Sender:` is absent, `From:` is used instead.
`return-path`	The return email address, as specified in `Mail From:`.
`message-info`	Event type–dependent message information.
`directionality`	The direction of the message; an enumerable quantity.
`tenant-id`	No longer used.
`original-client-ip`	The IP address of the client.
`original-server-ip`	The IP address of the server.
`custom-data`	Additional data dependent on the type of event.

Additional Useful Logfiles

A number of additional useful services may or may not be present on your network, and as part of mapping and situational awareness, you should identify them and determine what logging is possible. In this section, I provide a small list of the services I consider most critical to keep track of.

These are largely enterprise services, and any discussion on the proper generation and configuration of logs will be a function of the application’s configuration. In addition, for several of these services, you may need to reconfigure them to log the data you need. Many of these services will provide logging through syslog or the Windows Event Manager.

Staged Logging

Turning on all the logging mentioned here is going to drown analysts in a lot of data, which is primarily going to be used for rare, high-risk events. Consequently, when developing a logging plan, you should consider policies and processes for increasing or decreasing targeted logging as needed.

The process of staging up or staging down logging will be a function of events or a criticality. In the former case, you may increase logging because a particular event or trigger has raised concerns—for example, if a new exploit is in the wild, you may stage up logging for services vulnerable to that exploit. In the latter case, you may always keep high-information logging for critical services—monitoring fileshares for your IP, for example.

LDAP and Directory Services

If you have any form of active directory or other user management (Microsoft Active Directory, OpenLDAP), this information should be available. Directory services will generally consist of a database of users, events that update the database itself (addition, removal, or updates of users), as well as login and logoff events.

Consider collecting a complete, periodic dump of the directory and keeping it somewhere where the ops and analysis teams can get their hands on it quickly, such as storing it in Redis. You can expect that analysts will regularly access this data for context.

Logon and logoff data should be sent as a low-priority stream directly to your main console. It is useful for annotating where users are in the system at any time, and will often be cross-referenced with other actions.

File Transfer, Storage, and Databases

Besides HTTP, file transfer and file storage includes services such as SharePoint, NFS Mounts, FTP, anything using SMB, any code repositories (Git, GitHub, GitLab, SourceSafe, CVS, SVN), as well as web-based services such as Confluence. In the case of these services, you are most interested in monitoring users and times—who accessed the system, when they accessed the system, and how much they accessed.

Volumes are a good, if coarse, indicator of file transfer. A good companion metric is to check for locality anomalies (see Chapter 14). Fileshares inside of an enterprise are accessed by people who have a job—the majority of them are going to visit the same files over and repeatedly, working with a limited subset.¹

Databases, including SQL servers such as Oracle, Postgres, and MySQL and NoSQL systems such as HDFS, should be tracked for data loss prevention and bulk transfer, just as with the file transfer systems. In addition, if possible, log the queries and check for anomalous query strings. In an enterprise environment, you should expect to see users rarely interacting with the console; rather, they should be using predictable sequences of SQL statements run through a form. Distance metrics (see Chapter 12 for more information) can be used to match these strings to look for anomalous queries such as a SELECT *.

Logfile Transport: Transfers, Syslog, and Message Queues

Host logs can be transferred off their hosts in a number of ways, depending on how the logs are generated and on the capabilities of the operating system. The most common approaches involve using regular file transfers or the syslog protocol. A newer approach uses message queues to transport log information.

Transfer and Logfile Rotation

Most logging applications write to a rotating logfile (see, for example, the rotated system logs in “Accessing and Manipulating Logfiles”). In these cases, the logfile will be closed and archived after a fixed period and a new file will be started. Once the file is closed, it can be copied over to a different location to support analytics.

File transfer is simple. It can be implemented using SSH or any other copying protocol. The major headache is ensuring that the files are actually complete when copied; the rotation period for the file effectively dictates your response time. For example, if a file is rotated every 24 hours, then you will, on average, have to wait a day to get hold of the latest events.

Syslog

The grandfather of systematic system logging utilities is syslog, a standard approach to logging originally developed for Unix systems that now comprises a standard, a protocol, and a general framework for discussing logging messages. Syslog defines a fixed message format and enables messages to be sent to logger daemons that might reside on the host or be remotely located.

All syslog messages contain a time, a facility, a severity, and a text message. Tables 5-3 and 5-4 describe the facilities and priorities encoded in the syslog protocol. As Table 5-3 shows, the facilities referred to by syslog comprise a variety of fundamental systems (some of them largely obsolete). Of more concern is what facilities are not covered—DNS and HTTP, for example. The priorities (in Table 5-4) are generally more germane, as the vocabulary for their severity has entered into common parlance.

Table 5-3. Syslog facilities
Value	Meaning
0	Kernel
1	User level
2	Mail
3	System daemons
4	Security/authorization
5	`syslogd`
6	Line printer
7	Network news
8	UUCP
9	Clock daemon
10	Security/authorization
11	`ftpd`
12	`ntpd`
13	Log audit
14	Log alert
15	Clock daemon
16–23	Reserved for local use

Table 5-4. Syslog priorities
Value	Meaning
0	Emergency: system is unusable
1	Alert: action must be taken immediately
2	Critical: critical conditions
3	Error: error conditions
4	Warning: warning conditions
5	Notice: normal but significant condition
6	Informational: informational messages
7	Debug: debugging information

Syslog’s reference implementations are UDP-based, and the UDP standard results in several constraints. Most importantly, UDP datagram length is constrained by the MTU of the layer 2 protocol carrying the datagram, effectively imposing a hard limit of about 1,450 characters on any syslog message. The syslog protocol itself specifies that messages should be less than 1,024 characters, but this is erratically enforced, while the UDP cutoff will affect long messages. In addition, syslog runs on top of UDP, which means that when messages are dropped, they are lost forever.

The easiest way to solve this problem is to use TCP-based syslog, which is implemented in the open source domain with tools such as syslog-ng and rsyslog. Both of these tools provide TCP transport, as well as a number of other capabilities such as database interfaces, the ability to rewrite messages en route, and selective transport of syslog messages to different receivers. Windows does not support syslog natively, but there exist a number of commercial applications that provide similar functionality.

CEF: Common Event Format

Syslog is a transport protocol—it doesn’t specify anything about the actual contents of a message. A number of different organizations have attempted to develop interoperability standards for security applications, such as the Common Intrusion Detection Framework (CIDF) and Intrusion Detection Message Exchange Format (IDMEF). None of them have achieved serious industry acceptance.

What has been accepted widely is CEF. Originally developed by ArcSight (now part of Hewlett-Packard) to provide sensor developers with a standard format in which to send messages to their console, CEF is a record format that specifies events using a numeric header and a set of key/value pairs. For example, a CEF message for an attack from host 192.168.1.1 might look like this:

CEF:0|My Attack Detector|Test|1.0|1000|Attack|5|src=192.168.1.1

CEF is transport-agnostic, but the majority of CEF implementations use syslog as their transport of choice. The actual specification and key/value assignments are available from HP.

As of the new edition of this book, threat intelligence formats such as Structured Threat Information eXchange (STIX) have gained traction. See Chapter 17 for more information.

Table of Contents for
5. Sensors in the Service Domain

Chapter 5. Sensors in the Service Domain

Representative Logfile Formats

HTTP: CLF and ELF

Example 5-1. Accessing an HTTP server using the command line

Example 5-2. Examples of CLF and ELF

Simple Mail Transfer Protocol (SMTP)

Sendmail

Microsoft Exchange: Message Tracking Logs

Additional Useful Logfiles

Staged Logging

LDAP and Directory Services

File Transfer, Storage, and Databases

Logfile Transport: Transfers, Syslog, and Message Queues

Transfer and Logfile Rotation

Syslog

Further Reading

Table of Contents for 5. Sensors in the Service Domain

Create new playlist

Sign In

Sign Up

Chapter 5. Sensors in the Service Domain

Representative Logfile Formats

HTTP: CLF and ELF

Example 5-1. Accessing an HTTP server using the command line

Example 5-2. Examples of CLF and ELF

Simple Mail Transfer Protocol (SMTP)

Sendmail

Microsoft Exchange: Message Tracking Logs

Additional Useful Logfiles

Staged Logging

LDAP and Directory Services

File Transfer, Storage, and Databases

Logfile Transport: Transfers, Syslog, and Message Queues

Transfer and Logfile Rotation

Syslog

Further Reading

Table of Contents for
5. Sensors in the Service Domain