Chapter 5. Sensors in the Service Domain

This chapter discusses specific sensors in the service domain. Service sensors, including HTTP server logs and mail transfer logs, describe the activity of a particular service: who sent mail to whom, what URLs were accessed in the last five minutes, activity that’s moderated through a particular service.

As we saw in the previous chapter, service domain data is log data. Where available, logs are often preferable to other sources because they are generated by the affected process, removing the interpretation and guesswork often needed with network data. Service logs provide concrete information about events that, viewed from the network perspective, are hard to reconstruct.

Logs have a number of problems, the most important one being a management headache—in order to use a log, you have to know it exists and get access to it. In addition, host-based logs come in a large number of formats, many of them poorly documented. At the risk of a sweeping generalization, the overwhelming majority of logs are designed for debugging and troubleshooting individual hosts, not for evaluating security across networks. Where possible, you’ll often need to reconfigure them to include more security-relevant information, possibly needing to write your own aggregation programs. Finally, logs are a target; attackers will modify or disable logging if possible.

Logs complement network data. Network data is good at finding blind spots, confirming phenomena reported in logs and identifying things that the logs won’t pick up. An effective security system combines both: network data for a broad scope, service logs for fine detail.

The remainder of this chapter is focused on data from a number of host logs, including system logfiles. We begin by discussing several varieties of log data and preferable message formats. We then discuss specific host and service logs HTTP server log formats, email log formats, and Unix system logs.

Representative Logfile Formats

This section discusses common logfile formats, including ELF and CLF, the standard log formats for HTML messages. The formats discussed here are customizable, and I will provide guidelines for improving the log messages in order to provide more security-relevant information.

HTTP: CLF and ELF

HTTP is the modern internet’s reason for existence, and since its development in 1991, it has metamorphosed from a simple library protocol into the internet’s glue. Applications where formerly a developer would have implemented a new service are now routinely offloaded to HTTP and REST APIs.

HTTP is a challenging service to nail down. The core is incredibly simple, but any modern web browsing session involves combining HTTP, HTML, and JavaScript to create ad hoc clients of immense complexity. In this section, we briefly discuss the core components of HTTP with a focus on the analytical aspects.

HTTP is fundamentally a very simple file access service. To understand how simple it is today, try the exercise in Example 5-1 using netcat. netcat (which can also be invoked as nc, perhaps because administrators found it so useful that they wanted to make it easy to invoke) is a flexible network port access tool that can be used to directly send information to ports. It is an ideal tool for quickly bashing together clients with minimal scripting.

Example 5-1. Accessing an HTTP server using the command line
host$ echo 'GET /' | nc www.google.com 80 > google.html

Executing the command in this example should produce a valid HTML file. In its simplest, most unadorned form, an HTTP session consists of opening up a connection, passing a method and a URI, and receiving a file in return.

HTTP is simple enough to be run at the command line by hand if need be—however, that also means that an enormous amount of functionality is handed over to optional headers. When dealing with HTTP logs, the primary challenge is deciding which headers to include and which to ignore. If you try the very simple command in Example 5-1 on other servers, you’ll find it tends to hang—without additional information such as the Host or User-Agent, the server will wait.

There are two standards for HTTP log data: Common Log Format (CLF) and Extended Log Format (ELF). Most HTTP log generators (such as Apache’s mod_log) provide extensive configuration options.

CLF is a single-line logging format developed by the National Center for Supercomputing Applications (NCSA) for the original HTTP server; the W3C provides a minimal definition of the standard. A CLF event is defined as a seven-value single-line record in the following format:

remotehost rfc931 authuser [date] "request" status bytes

Where remotehost is the IP name or address of the remote host, rfc931 is the remote login account name of the user, authuser is the user’s authenticated name, date is the date and time of the request, request is the request, status is the HTTP status code, and bytes is the number of bytes.

Pure CLF has several eccentricities that can make parsing problematic. The rfc931 and authuser fields are effectively artifacts; in the vast majority of CLF records, these fields will be set to . The actual format of the date value is unspecified and can vary between different HTTP server implementations.

A common modification of CLF is Combined Log Format. The Combined Log Format adds two additional fields to CLF: the HTTP Referer field and the User-Agent string.

ELF is an expandable columnar format that has largely been confined to Microsoft’s Internet Information Server (IIS), although tools such as Bluecoat also use it for logging. As with CLF, the W3C maintains the standard on its website.

An ELF file consists of a sequence of directives followed by a sequence of entries. Directives are used to define attributes common to the entries, such as the date of all entries (the Date directive), and the fields in the entry (the Fields directive). Each entry in ELF is a single HTTP request, and the fields that are defined by the directive are included in that entry.

ELF fields come in one of three forms: identifier, prefix-identifier, or prefix(header). The prefix is a one- or two-character string that defines the direction the information took (c for client, s for server, r for remote). The identifier describes the contents of the field, and the prefix(header) value includes the corresponding HTTP header. For example, cs-method is in the prefix-identifier format and describes the method sent from client to server, while time is a plain identifier denoting the time at which the session ended.

Example 5-2 shows simple outputs from CLF, Combined Log Format, and ELF. Each event is a single line.

Example 5-2. Examples of CLF and ELF
#CLF
192.168.1.1 - - [2012/Oct/11 12:03:45 -0700] "GET /index.html" 200
1294

# Combined Log Format
192.168.1.1 - - [2012/Oct/11 12:03:45 -0700] "GET /index.html" 200 1294
"http://www.example.com/link.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

#ELF
#Version: 1.0
#Date: 2012/Oct/11 00:00:00
#Fields: time c-ip cs-method cs-uri
12:03:45 192.168.1.1 GET /index.html

Most HTTP logs are some form of CLF output. Although ELF is an expandable format, I find the need to carry the header around problematic in that I don’t expect to change formats that much, and would rather that individual log records be interpretable without this information. Based on principles I discussed earlier, here is how I modify CLF records:

  1. Remove the rfc931 and authuser fields. These fields are artifacts and waste space.

  2. Convert the date to epoch time and represent it as a numeric string. In addition to my general disdain for text over numeric representations, time representations have never been standardized in HTTP logfiles. You’re better off moving to a numeric format to ignore the whims of the server.

  3. Incorporate the server IP address, the source port, and the destination port. I expect to move the logfiles to a central location for analysis, so I need the server address to differentiate them. This gets me closer to a five-tuple that I can correlate with other data.

  4. Add the duration of the event, again to help with timing correlation.

  5. Add the host header. In case I’m dealing with virtual hosts, this also helps me identify systems that contact the server without using DNS as a moderator.

Simple Mail Transfer Protocol (SMTP)

SMTP log messages vary by the mail transfer agent (MTA) used and are highly configurable. In this section, we discuss two log formats that are representative of the major Unix and Windows families: sendmail and Microsoft Exchange.

We focus on logging the transfer of email messages. The logging tools for these applications provide an enormous amount of information about the server’s internal status, connection attempts, and other data that, while enormously valuable, requires a book of its own.

Sendmail

Sendmail moderates mail exchange through syslog, and consequently is capable of sending an enormous number of informational messages besides the actual email transaction. For our purposes, we are concerned with two classes of log messages: messages describing connections to and from the mail server, and messages describing actual mail delivery.

By default, sendmail will send messages to /var/maillog, although the logging information it sends is controlled by sendmail’s internal logging level. The logging level ranges from 1 to 96; a log level of n logs all messages of severity 1 to n. Notable log levels include 9 (all message deliveries logged), 10 (inbound connections logged), 12 (outbound connections logged), and 14 (connection refusals logged). Of note is that anything above log level 8 is considered an informational log in syslog, and anything above 11 a debug log message.

A sendmail log line consists of five fixed values, followed by a list of one or more equates:

<date> <host> sendmail[<pid>]: <qid>: <equates>

where <date> is the date, <host> is the name of the host, sendmail is a literal string, <pid> is the sendmail process ID, and <qid> is an internal queue ID used to uniquely identify messages. Sendmail sends at least two log messages when sending an email message, and the only way to group those messages together is through the qid. Equates are descriptive parameters given in the form <key>=<value>. Sendmail can send a number of potential equates, listed in Table 5-1 for messages.

Table 5-1. Relevant sendmail equates
Equate Description

arg1

Current sendmail implementations enable internal filtering using rulesets; arg1 is the argument passed to the ruleset.

from

The from address of the envelope.

msgid

The message ID of the email.

quarantine

If sendmail quarantines a mail, this is the reason it was held.

reject

If sendmail rejects a mail, this is the reason for rejection.

relay

This is the name and address of the host that sent the message; in recipient lines, it’s the host that sent it, and in sender lines, the host that received it.

ruleset

This is the ruleset that processed the message, and provides the justification for rejecting, quarantining, or sending the message.

stat

The status of a message’s delivery.

to

The email address of a target; multiple to equates can appear in the same line.

For every email message received, sendmail generates at least two log lines. The first line is the receipt line, and describes the message’s point of origin. The final line, the sender line, describes the disposition of the mail, such as whether it was sent, quarantined, rejected, or bounced.

Sendmail will take one of four basic actions with a message: reject it, quarantine it, bounce it, or send it. Rejection is implemented by message filtering and is used for spam filtering; a rejected message is dropped. Quarantined messages are moved off the queue to a separate area for further review. A bounce means the mail was not sent to the target, and results in a nondelivery report being sent back to the origin.

Microsoft Exchange: Message Tracking Logs

Exchange has one master log format for handling messages, the Message Tracking Log (MTL). Table 5-2 describes the fields.

Table 5-2. MTL fields
Field name Description

date-time

The ISO 8601 representation of the date and time.

client-ip

The IP address of the host that submitted the message to the server.

client-hostname

The client_ip’s fully qualified domain name (FQDN).

server-ip

The IP address of the server.

server-hostname

The server_ip’s FQDN.

source-context

Optional information about the source, such as an identifier for the transport agent.

connector-id

The name of the connector.

source

Exchange enumerates a number of source identities for defining the origin of a message, such as an inbox rule, a transport agent, or DNS. The source field will contain this identity.

event-id

The event type. This is also an enumerable quantity, and includes a number of status messages about how the message was handled.

internal-message-id

An internal integer identifier used by Exchange to differentiate messages. The ID is not shared between Exchange servers, so if a message is passed around, this value will change.

message-id

The standard SMTP message ID. Exchange will create one if the message does not already have one.

network-message-id

This is a message ID like internal-message-id except that it is shared across copies of the message and created when a message is cloned or duplicated, such as when it’s sent to a distribution list.

recipient-address

The addresses of the recipients; this is a semicolon-delimited list of names.

recipient-status

A per-recipient status code indicating how each recipient was handled.

total-bytes

The total size of the message in bytes.

recipient-count

The size of recipient-address in terms of number of recipients.

related-recipient-address

Certain Exchange events (such as redirection) will result in additional recipients being added to the list; those addresses are added here.

reference

This is message-specific information; the contents are a function of the type of message (defined in event-id).

message-subject

The subject found in the Subject: header.

sender-address

The sender, as specified in the Sender: header; if Sender: is absent, From: is used instead.

return-path

The return email address, as specified in Mail From:.

message-info

Event type–dependent message information.

directionality

The direction of the message; an enumerable quantity.

tenant-id

No longer used.

original-client-ip

The IP address of the client.

original-server-ip

The IP address of the server.

custom-data

Additional data dependent on the type of event.

Additional Useful Logfiles

A number of additional useful services may or may not be present on your network, and as part of mapping and situational awareness, you should identify them and determine what logging is possible. In this section, I provide a small list of the services I consider most critical to keep track of.

These are largely enterprise services, and any discussion on the proper generation and configuration of logs will be a function of the application’s configuration. In addition, for several of these services, you may need to reconfigure them to log the data you need. Many of these services will provide logging through syslog or the Windows Event Manager.

Staged Logging

Turning on all the logging mentioned here is going to drown analysts in a lot of data, which is primarily going to be used for rare, high-risk events. Consequently, when developing a logging plan, you should consider policies and processes for increasing or decreasing targeted logging as needed.

The process of staging up or staging down logging will be a function of events or a criticality. In the former case, you may increase logging because a particular event or trigger has raised concerns—for example, if a new exploit is in the wild, you may stage up logging for services vulnerable to that exploit. In the latter case, you may always keep high-information logging for critical services—monitoring fileshares for your IP, for example.

LDAP and Directory Services

If you have any form of active directory or other user management (Microsoft Active Directory, OpenLDAP), this information should be available. Directory services will generally consist of a database of users, events that update the database itself (addition, removal, or updates of users), as well as login and logoff events.

Consider collecting a complete, periodic dump of the directory and keeping it somewhere where the ops and analysis teams can get their hands on it quickly, such as storing it in Redis. You can expect that analysts will regularly access this data for context.

Logon and logoff data should be sent as a low-priority stream directly to your main console. It is useful for annotating where users are in the system at any time, and will often be cross-referenced with other actions.

File Transfer, Storage, and Databases

Besides HTTP, file transfer and file storage includes services such as SharePoint, NFS Mounts, FTP, anything using SMB, any code repositories (Git, GitHub, GitLab, SourceSafe, CVS, SVN), as well as web-based services such as Confluence. In the case of these services, you are most interested in monitoring users and times—who accessed the system, when they accessed the system, and how much they accessed.

Volumes are a good, if coarse, indicator of file transfer. A good companion metric is to check for locality anomalies (see Chapter 14). Fileshares inside of an enterprise are accessed by people who have a job—the majority of them are going to visit the same files over and repeatedly, working with a limited subset.1

Databases, including SQL servers such as Oracle, Postgres, and MySQL and NoSQL systems such as HDFS, should be tracked for data loss prevention and bulk transfer, just as with the file transfer systems. In addition, if possible, log the queries and check for anomalous query strings. In an enterprise environment, you should expect to see users rarely interacting with the console; rather, they should be using predictable sequences of SQL statements run through a form. Distance metrics (see Chapter 12 for more information) can be used to match these strings to look for anomalous queries such as a SELECT *.

Logfile Transport: Transfers, Syslog, and Message Queues

Host logs can be transferred off their hosts in a number of ways, depending on how the logs are generated and on the capabilities of the operating system. The most common approaches involve using regular file transfers or the syslog protocol. A newer approach uses message queues to transport log information.

Transfer and Logfile Rotation

Most logging applications write to a rotating logfile (see, for example, the rotated system logs in “Accessing and Manipulating Logfiles”). In these cases, the logfile will be closed and archived after a fixed period and a new file will be started. Once the file is closed, it can be copied over to a different location to support analytics.

File transfer is simple. It can be implemented using SSH or any other copying protocol. The major headache is ensuring that the files are actually complete when copied; the rotation period for the file effectively dictates your response time. For example, if a file is rotated every 24 hours, then you will, on average, have to wait a day to get hold of the latest events.

Syslog

The grandfather of systematic system logging utilities is syslog, a standard approach to logging originally developed for Unix systems that now comprises a standard, a protocol, and a general framework for discussing logging messages. Syslog defines a fixed message format and enables messages to be sent to logger daemons that might reside on the host or be remotely located.

All syslog messages contain a time, a facility, a severity, and a text message. Tables 5-3 and 5-4 describe the facilities and priorities encoded in the syslog protocol. As Table 5-3 shows, the facilities referred to by syslog comprise a variety of fundamental systems (some of them largely obsolete). Of more concern is what facilities are not covered—DNS and HTTP, for example. The priorities (in Table 5-4) are generally more germane, as the vocabulary for their severity has entered into common parlance.

Table 5-3. Syslog facilities
Value Meaning

0

Kernel

1

User level

2

Mail

3

System daemons

4

Security/authorization

5

syslogd

6

Line printer

7

Network news

8

UUCP

9

Clock daemon

10

Security/authorization

11

ftpd

12

ntpd

13

Log audit

14

Log alert

15

Clock daemon

16–23

Reserved for local use

Table 5-4. Syslog priorities
Value Meaning

0

Emergency: system is unusable

1

Alert: action must be taken immediately

2

Critical: critical conditions

3

Error: error conditions

4

Warning: warning conditions

5

Notice: normal but significant condition

6

Informational: informational messages

7

Debug: debugging information

Syslog’s reference implementations are UDP-based, and the UDP standard results in several constraints. Most importantly, UDP datagram length is constrained by the MTU of the layer 2 protocol carrying the datagram, effectively imposing a hard limit of about 1,450 characters on any syslog message. The syslog protocol itself specifies that messages should be less than 1,024 characters, but this is erratically enforced, while the UDP cutoff will affect long messages. In addition, syslog runs on top of UDP, which means that when messages are dropped, they are lost forever.

The easiest way to solve this problem is to use TCP-based syslog, which is implemented in the open source domain with tools such as syslog-ng and rsyslog. Both of these tools provide TCP transport, as well as a number of other capabilities such as database interfaces, the ability to rewrite messages en route, and selective transport of syslog messages to different receivers. Windows does not support syslog natively, but there exist a number of commercial applications that provide similar functionality.

Further Reading

  1. M. O’Leary, Cyber Operations: Building, Defending, and Attacking Modern Computer Networks (New York: Apress, 2015).

1 Keep track of systems that do file transfers, and internal search engines. Both will be false positives.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.85.142