This chapter discusses specific sensors in the service domain. Service sensors, including HTTP server logs and mail transfer logs, describe the activity of a particular service: who sent mail to whom, what URLs were accessed in the last five minutes, activity that’s moderated through a particular service.
As we saw in the previous chapter, service domain data is log data. Where available, logs are often preferable to other sources because they are generated by the affected process, removing the interpretation and guesswork often needed with network data. Service logs provide concrete information about events that, viewed from the network perspective, are hard to reconstruct.
Logs have a number of problems, the most important one being a management headache—in order to use a log, you have to know it exists and get access to it. In addition, host-based logs come in a large number of formats, many of them poorly documented. At the risk of a sweeping generalization, the overwhelming majority of logs are designed for debugging and troubleshooting individual hosts, not for evaluating security across networks. Where possible, you’ll often need to reconfigure them to include more security-relevant information, possibly needing to write your own aggregation programs. Finally, logs are a target; attackers will modify or disable logging if possible.
Logs complement network data. Network data is good at finding blind spots, confirming phenomena reported in logs and identifying things that the logs won’t pick up. An effective security system combines both: network data for a broad scope, service logs for fine detail.
The remainder of this chapter is focused on data from a number of host logs, including system logfiles. We begin by discussing several varieties of log data and preferable message formats. We then discuss specific host and service logs HTTP server log formats, email log formats, and Unix system logs.
This section discusses common logfile formats, including ELF and CLF, the standard log formats for HTML messages. The formats discussed here are customizable, and I will provide guidelines for improving the log messages in order to provide more security-relevant information.
HTTP is the modern internet’s reason for existence, and since its development in 1991, it has metamorphosed from a simple library protocol into the internet’s glue. Applications where formerly a developer would have implemented a new service are now routinely offloaded to HTTP and REST APIs.
HTTP is a challenging service to nail down. The core is incredibly simple, but any modern web browsing session involves combining HTTP, HTML, and JavaScript to create ad hoc clients of immense complexity. In this section, we briefly discuss the core components of HTTP with a focus on the analytical aspects.
HTTP is fundamentally a very simple file access service. To
understand how simple it is today, try the exercise in Example 5-1
using netcat
. netcat
(which can also be invoked as nc
, perhaps
because administrators found it so useful that they wanted to make it
easy to invoke) is a flexible network port access tool that can be
used to directly send information to ports. It is an ideal tool for quickly bashing together clients with minimal scripting.
host$ echo 'GET /' | nc www.google.com 80 > google.html
Executing the command in this example should produce a valid HTML file. In its simplest, most unadorned form, an HTTP session consists of opening up a connection, passing a method and a URI, and receiving a file in return.
HTTP is simple enough to be run at the command line by hand if need
be—however, that also means that an enormous amount of functionality
is handed over to optional headers. When dealing with HTTP logs, the
primary challenge is deciding which headers to include and which to
ignore. If you try the very simple command in Example 5-1 on other
servers, you’ll find it tends to hang—without additional
information such as the Host
or User-Agent
, the server will wait.
There are two standards for HTTP log data: Common Log Format (CLF) and Extended Log Format (ELF). Most HTTP log generators (such as Apache’s
mod_log
) provide extensive configuration options.
CLF is a single-line logging format developed by the National Center for Supercomputing Applications (NCSA) for the original HTTP server; the W3C provides a minimal definition of the standard. A CLF event is defined as a seven-value single-line record in the following format:
remotehost rfc931 authuser [date] "request" status bytes
Where remotehost
is the IP name or address of the remote host,
rfc931
is the remote login account name of the user, authuser
is the user’s
authenticated name, date
is the date and time of the request,
request
is the request, status
is the HTTP status code, and
bytes
is the number of bytes.
Pure CLF has several eccentricities that can make parsing problematic.
The rfc931
and authuser
fields are effectively artifacts; in the
vast majority of CLF records, these fields will be set to –
. The
actual format of the date value is unspecified and can vary between
different HTTP server implementations.
A common modification of CLF is Combined Log Format. The Combined
Log Format adds two additional fields to CLF: the HTTP Referer
field
and the User-Agent
string.
ELF is an expandable columnar format that has largely been confined to Microsoft’s Internet Information Server (IIS), although tools such as Bluecoat also use it for logging. As with CLF, the W3C maintains the standard on its website.
An ELF file consists of a sequence of directives
followed by a
sequence of entries
. Directives are used to define attributes
common to the entries, such as the date of all entries (the Date
directive), and the fields in the entry (the Fields
directive).
Each entry in ELF is a single HTTP request, and the fields that are
defined by the directive are included in that entry.
ELF fields come in one of three forms: identifier,
prefix-identifier, or prefix(header). The prefix is a one- or two-character string that defines the direction the information took (c
for client, s
for server, r
for remote). The identifier describes
the contents of the field, and the prefix(header) value includes the
corresponding HTTP header. For example, cs-method
is in the
prefix-identifier format and describes the method sent from client
to server, while time
is a plain identifier denoting the time at
which the session ended.
Example 5-2 shows simple outputs from CLF, Combined Log Format, and ELF. Each event is a single line.
#CLF 192.168.1.1 - - [2012/Oct/11 12:03:45 -0700] "GET /index.html" 200 1294 # Combined Log Format 192.168.1.1 - - [2012/Oct/11 12:03:45 -0700] "GET /index.html" 200 1294 "http://www.example.com/link.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" #ELF #Version: 1.0 #Date: 2012/Oct/11 00:00:00 #Fields: time c-ip cs-method cs-uri 12:03:45 192.168.1.1 GET /index.html
Most HTTP logs are some form of CLF output. Although ELF is an expandable format, I find the need to carry the header around problematic in that I don’t expect to change formats that much, and would rather that individual log records be interpretable without this information. Based on principles I discussed earlier, here is how I modify CLF records:
Remove the rfc931
and authuser
fields.
These fields are artifacts and waste space.
Convert the date to epoch time and represent it as a numeric string. In addition to my general disdain for text over numeric representations, time representations have never been standardized in HTTP logfiles. You’re better off moving to a numeric format to ignore the whims of the server.
Incorporate the server IP address, the source port, and the destination port. I expect to move the logfiles to a central location for analysis, so I need the server address to differentiate them. This gets me closer to a five-tuple that I can correlate with other data.
Add the duration of the event, again to help with timing correlation.
Add the host header. In case I’m dealing with virtual hosts, this also helps me identify systems that contact the server without using DNS as a moderator.
SMTP log messages vary by the mail transfer agent (MTA) used and are highly configurable. In this section, we discuss two log formats that are representative of the major Unix and Windows families: sendmail and Microsoft Exchange.
We focus on logging the transfer of email messages. The logging tools for these applications provide an enormous amount of information about the server’s internal status, connection attempts, and other data that, while enormously valuable, requires a book of its own.
Sendmail moderates mail exchange through syslog, and consequently is capable of sending an enormous number of informational messages besides the actual email transaction. For our purposes, we are concerned with two classes of log messages: messages describing connections to and from the mail server, and messages describing actual mail delivery.
By default, sendmail will send messages to /var/maillog, although the logging information it sends is controlled by sendmail’s internal logging level. The logging level ranges from 1 to 96; a log level of n logs all messages of severity 1 to n. Notable log levels include 9 (all message deliveries logged), 10 (inbound connections logged), 12 (outbound connections logged), and 14 (connection refusals logged). Of note is that anything above log level 8 is considered an informational log in syslog, and anything above 11 a debug log message.
A sendmail log line consists of five fixed values, followed by a list of one or more equates:
<date> <host> sendmail[<pid>]: <qid>: <equates>
where <date>
is the date, <host>
is the name of the host, sendmail
is a literal string, <pid>
is the sendmail process ID, and
<qid>
is an internal queue ID used to uniquely identify messages.
Sendmail sends at least two log messages when sending an email
message, and the only way to group those messages together is through
the qid
. Equates are descriptive parameters given in the form
<key>=<value>
. Sendmail can send a number of potential equates,
listed in Table 5-1 for messages.
Equate | Description |
---|---|
|
Current sendmail implementations enable internal filtering using rulesets; |
|
The from address of the envelope. |
|
The message ID of the email. |
|
If sendmail quarantines a mail, this is the reason it was held. |
|
If sendmail rejects a mail, this is the reason for rejection. |
|
This is the name and address of the host that sent the message; in recipient lines, it’s the host that sent it, and in sender lines, the host that received it. |
|
This is the ruleset that processed the message, and provides the justification for rejecting, quarantining, or sending the message. |
|
The status of a message’s delivery. |
|
The email address of a target; multiple |
For every email message received, sendmail generates at least two log lines. The first line is the receipt line, and describes the message’s point of origin. The final line, the sender line, describes the disposition of the mail, such as whether it was sent, quarantined, rejected, or bounced.
Sendmail will take one of four basic actions with a message: reject it, quarantine it, bounce it, or send it. Rejection is implemented by message filtering and is used for spam filtering; a rejected message is dropped. Quarantined messages are moved off the queue to a separate area for further review. A bounce means the mail was not sent to the target, and results in a nondelivery report being sent back to the origin.
Exchange has one master log format for handling messages, the Message Tracking Log (MTL). Table 5-2 describes the fields.
Field name | Description |
---|---|
|
The ISO 8601 representation of the date and time. |
|
The IP address of the host that submitted the message to the server. |
|
The |
|
The IP address of the server. |
|
The |
|
Optional information about the source, such as an identifier for the transport agent. |
|
The name of the connector. |
|
Exchange enumerates a number of source identities for defining the origin of a message, such as an inbox rule, a transport agent, or DNS. The |
|
The event type. This is also an enumerable quantity, and includes a number of status messages about how the message was handled. |
|
An internal integer identifier used by Exchange to differentiate messages. The ID is not shared between Exchange servers, so if a message is passed around, this value will change. |
|
The standard SMTP message ID. Exchange will create one if the message does not already have one. |
|
This is a message ID like |
|
The addresses of the recipients; this is a semicolon-delimited list of names. |
|
A per-recipient status code indicating how each recipient was handled. |
|
The total size of the message in bytes. |
|
The size of |
|
Certain Exchange events (such as redirection) will result in additional recipients being added to the list; those addresses are added here. |
|
This is message-specific information; the contents are a function of the type of message (defined in |
|
The subject found in the |
|
The sender, as specified in the |
|
The return email address, as specified in |
|
Event type–dependent message information. |
|
The direction of the message; an enumerable quantity. |
|
No longer used. |
|
The IP address of the client. |
|
The IP address of the server. |
|
Additional data dependent on the type of event. |
A number of additional useful services may or may not be present on your network, and as part of mapping and situational awareness, you should identify them and determine what logging is possible. In this section, I provide a small list of the services I consider most critical to keep track of.
These are largely enterprise services, and any discussion on the proper generation and configuration of logs will be a function of the application’s configuration. In addition, for several of these services, you may need to reconfigure them to log the data you need. Many of these services will provide logging through syslog or the Windows Event Manager.
Turning on all the logging mentioned here is going to drown analysts in a lot of data, which is primarily going to be used for rare, high-risk events. Consequently, when developing a logging plan, you should consider policies and processes for increasing or decreasing targeted logging as needed.
The process of staging up or staging down logging will be a function of events or a criticality. In the former case, you may increase logging because a particular event or trigger has raised concerns—for example, if a new exploit is in the wild, you may stage up logging for services vulnerable to that exploit. In the latter case, you may always keep high-information logging for critical services—monitoring fileshares for your IP, for example.
If you have any form of active directory or other user management (Microsoft Active Directory, OpenLDAP), this information should be available. Directory services will generally consist of a database of users, events that update the database itself (addition, removal, or updates of users), as well as login and logoff events.
Consider collecting a complete, periodic dump of the directory and keeping it somewhere where the ops and analysis teams can get their hands on it quickly, such as storing it in Redis. You can expect that analysts will regularly access this data for context.
Logon and logoff data should be sent as a low-priority stream directly to your main console. It is useful for annotating where users are in the system at any time, and will often be cross-referenced with other actions.
Besides HTTP, file transfer and file storage includes services such as SharePoint, NFS Mounts, FTP, anything using SMB, any code repositories (Git, GitHub, GitLab, SourceSafe, CVS, SVN), as well as web-based services such as Confluence. In the case of these services, you are most interested in monitoring users and times—who accessed the system, when they accessed the system, and how much they accessed.
Volumes are a good, if coarse, indicator of file transfer. A good companion metric is to check for locality anomalies (see Chapter 14). Fileshares inside of an enterprise are accessed by people who have a job—the majority of them are going to visit the same files over and repeatedly, working with a limited subset.1
Databases, including SQL servers such as Oracle, Postgres, and MySQL
and NoSQL systems such as HDFS, should be tracked for data loss
prevention and bulk transfer, just as with the file transfer systems.
In addition, if possible, log the queries and check for anomalous
query strings. In an enterprise environment, you should expect to see
users rarely interacting with the console; rather, they should be using
predictable sequences of SQL statements run through a form. Distance
metrics (see Chapter 12 for more information) can be used to match
these strings to look for anomalous queries such as a SELECT *
.
Host logs can be transferred off their hosts in a number of ways, depending on how the logs are generated and on the capabilities of the operating system. The most common approaches involve using regular file transfers or the syslog protocol. A newer approach uses message queues to transport log information.
Most logging applications write to a rotating logfile (see, for example, the rotated system logs in “Accessing and Manipulating Logfiles”). In these cases, the logfile will be closed and archived after a fixed period and a new file will be started. Once the file is closed, it can be copied over to a different location to support analytics.
File transfer is simple. It can be implemented using SSH or any other copying protocol. The major headache is ensuring that the files are actually complete when copied; the rotation period for the file effectively dictates your response time. For example, if a file is rotated every 24 hours, then you will, on average, have to wait a day to get hold of the latest events.
The grandfather of systematic system logging utilities is syslog, a standard approach to logging originally developed for Unix systems that now comprises a standard, a protocol, and a general framework for discussing logging messages. Syslog defines a fixed message format and enables messages to be sent to logger daemons that might reside on the host or be remotely located.
All syslog messages contain a time, a facility, a severity, and a text message. Tables 5-3 and 5-4 describe the facilities and priorities encoded in the syslog protocol. As Table 5-3 shows, the facilities referred to by syslog comprise a variety of fundamental systems (some of them largely obsolete). Of more concern is what facilities are not covered—DNS and HTTP, for example. The priorities (in Table 5-4) are generally more germane, as the vocabulary for their severity has entered into common parlance.
Value | Meaning |
---|---|
0 |
Kernel |
1 |
User level |
2 |
|
3 |
System daemons |
4 |
Security/authorization |
5 |
|
6 |
Line printer |
7 |
Network news |
8 |
UUCP |
9 |
Clock daemon |
10 |
Security/authorization |
11 |
|
12 |
|
13 |
Log audit |
14 |
Log alert |
15 |
Clock daemon |
16–23 |
Reserved for local use |
Value | Meaning |
---|---|
0 |
Emergency: system is unusable |
1 |
Alert: action must be taken immediately |
2 |
Critical: critical conditions |
3 |
Error: error conditions |
4 |
Warning: warning conditions |
5 |
Notice: normal but significant condition |
6 |
Informational: informational messages |
7 |
Debug: debugging information |
Syslog’s reference implementations are UDP-based, and the UDP standard results in several constraints. Most importantly, UDP datagram length is constrained by the MTU of the layer 2 protocol carrying the datagram, effectively imposing a hard limit of about 1,450 characters on any syslog message. The syslog protocol itself specifies that messages should be less than 1,024 characters, but this is erratically enforced, while the UDP cutoff will affect long messages. In addition, syslog runs on top of UDP, which means that when messages are dropped, they are lost forever.
The easiest way to solve this problem is to use TCP-based syslog,
which is implemented in the open source domain with tools such as
syslog-ng
and
rsyslog
. Both of these tools
provide TCP transport, as well as a number of other capabilities such
as database interfaces, the ability to rewrite messages en route, and
selective transport of syslog messages to different receivers.
Windows does not support syslog natively, but there exist a number of
commercial applications that provide similar functionality.
M. O’Leary, Cyber Operations: Building, Defending, and Attacking Modern Computer Networks (New York: Apress, 2015).
1 Keep track of systems that do file transfers, and internal search engines. Both will be false positives.
18.221.85.142