Chapter 3. Email

The vast majority of the scams that you might want to investigate are initiated by an email message. So it is only natural that these messages are a major target for forensic analysis. In this chapter, I will show you how to dissect message headers and distinguish between the real and forged information contained therein. I will show how you go about tracking back spam to its source and the approaches that spammers use to make that as difficult as possible. Then I will move on to the contents of email messages and show how you can safely extract attachments that may contain viruses or spyware.

Message Headers

The content of an email message is what first gets our attention but, in terms of forensics, the header block is the most interesting. Every message contains a series of header lines that instruct mail servers where to deliver it, tell mail readers how to process its content, and provide a record of the path taken by the message from its source to its destination. One reference on headers is RFC 2076 (Common Internet Message Headers), which can be found at http://rfc.net/rfc2076.html, but, as you will see, there is considerable variation in their format.

The fundamental flaw with email is that certain headers can be forged. This is what allows spam and all the other scams to flourish, even in the face of sophisticated filters and detection software. In looking at messages that are of interest to you, you need to understand what header information can be forged and what you can rely on. Let’s start by looking at the headers for a simple, legitimate message. The following is an email sent from my machine to a Gmail account at Google. I have deleted a few of the Gmail-specific headers and modified the addresses to protect privacy.

    Delivered-To: [email protected]
    Return-Path: <[email protected]>
    Received: by 10.54.18.32 with SMTP id 32cs2945wrr;
            Fri, 25 Feb 2005 15:27:07 -0800 (PST)
    Received: by 10.54.7.40 with SMTP id 40mr65062wrg;
            Fri, 25 Feb 2005 15:27:05 -0800 (PST)
    Received: from gateway.craic.com
            (gateway.craic.com [208.12.16.5])
            by mx.gmail.com
            with ESMTP id 9si124319wrl.2005.02.25.15.26.58;
            Fri, 25 Feb 2005 15:27:04 -0800 (PST)
    Received: from [192.168.2.7] (nexus.craic.com [208.12.16.2])
            by gateway.craic.com (8.11.6/8.11.6)
            with ESMTP id j1PNQvl31568
            for <[email protected]>;
            Fri, 25 Feb 2005 15:26:58 -0800
    Message-ID: <[email protected]>
    Date: Fri, 25 Feb 2005 15:26:57 -0800
    From: ABC <[email protected]>
    User-Agent: Mozilla Thunderbird 0.9 (X11/20041103)
    X-Accept-Language: en-us, en
    MIME-Version: 1.0
    To: [email protected]
    Subject: Test
    Content-Type: text/plain; charset=ISO-8859-1; format=flowed
    Content-Transfer-Encoding: 7bit

    This is a test

These headers are usually hidden in common email clients, but you can reveal them easily enough—for example, by selecting View Message Source in Mozilla Thunderbird.

Message headers fall into five classes. The basic addressing information is contained in the From and To lines, and information about the content is contained in the Subject line and those that begin with Content. The path taken from the sender through to delivery is recorded in the Received lines, and the unique identity of this message is captured in the Date and Message-ID lines. Ancillary information that might be useful for the email client is usually found in headers that begin with X-. The specific headers can vary widely according to the email client that was used to create the messages.

Looking at this example, you see that has sent a simple test message to . From the User-Agent header, you know that user ABC sent the message from the Mozilla Thunderbird email client.

The most interesting headers are the Received headers. In a legitimate email, each one of these represents a step taken by the message between two mail servers, or between a mail client and a server. With each additional step taken, a new header is added to the top of the message. By looking at these headers, you should be able to trace the complete path taken by a message from its source to its destination and vice versa.

Servers in this context are called Mail Transfer Agents (MTA ), and the majority of these communicate through either the Simple Mail Transfer Protocol (SMTP ) or the Enhanced Simple Mail Transfer Protocol (ESMTP ). In spite of Internet standards, the format used for Received headers is variable. In most cases, it takes this form:

    Received: from string (hostname [host IP address])
              by recipient host
              with protocol id message ID
              for recipient;
              timestamp
string

This is typically the hostname of the sending MTA, but it can be anything.

hostname

The hostname of the MTA if it can be determined by reverse DNS lookup on the IP address.

host IP address

The IP address of the sending MTA.

recipient host

This is typically the hostname of the receiving MTA. It is sometimes followed by the version of the MTA software running on that host.

protocol

The mail transfer protocol that was used for the transfer, such as SMTP.

message ID

A unique identifier for this transfer that can be searched for in the log files on the recipient MTA.

recipient

The email address of the recipient.

timestamp

The date and time at which the message was received by the MTA.

Note the use of parentheses and square brackets around the sending MTA. This will help distinguish truth from fiction when you look at forged headers.

Look at this example:

    Received: from biotech.craic.com (biotech.craic.com [208.12.16.3])
            by gateway.craic.com (8.11.6/8.11.6)
            with ESMTP id j21IBV720506
            for <[email protected]>;
            Tue, 1 Mar 2005 10:11:31 -0800

The numeric IP address in the square brackets defines the sending MTA, and a reverse DNS lookup by the receiving MTA has identified this machine as http://biotech.craic.com. That hostname is repeated in the string that precedes the parentheses. The message has been received by http://gateway.craic.com. There is no need for the IP address, since that MTA implicitly knows its own hostname. The version of the MTA software used is included here. The protocol used is ESMTP and the unique ID that follows should also appear in the log files on that server. The format of these IDs is arbitrary. This header includes the intended recipient for this message, although many headers do not. Finally, there is a timestamp that tells when the message was transferred, including the time difference from Greenwich Mean Time (GMT), which in this case is minus eight hours because the server is located in Seattle.

The string that precedes the parentheses on the from line is a favorite target for forging and it is worth understanding where this comes from. An SMTP or ESMTP transfer is initiated when the sending MTA identifies itself to the receiver. It does so by sending the string HELO, or EHLO in the case of ESMTP, followed by an identifying string. This can be anything the sender chooses and is the string that appears in the Received header. If the source of the message is a Linux system, then the default value for this string is taken from that system’s hostname in the file /etc/hosts. Changing that value will forge the apparent source of a message from that system.

Now you know how to read these headers, so you can retrace the steps taken by the example message, starting with the last Received header and working back to the first. The message appears to be sent from nexus to gateway. This is only partly correct. nexus happens to be a firewall between an internal network and the Internet. So gateway sees nexus as the source even though the real origin is behind that firewall. In this instance, you can identify that machine from the preceding string [192.168.2.7], but that will not generally be the case. The message is transferred to http://mx.gmail.com, then to IP address 10.54.7.40, and finally to 10.54.18.32. You can tell that these two addresses are part of Google’s private network because those numbers fall within one of the ranges of IP addresses that are reserved for internal networks.

Look at the time difference between the first and last header and see that it took nine seconds to deliver the message. Timestamps are extremely useful in assessing the performance of mail transfers, and a discrepancy in a series of them is often a clear indication that one or more headers have been forged. But timestamps are only as accurate as the clocks from which they derive. Keeping your system clocks synchronized using the Network Time Protocol (NTP) is strongly encouraged. You can find more information about this at http://www.ntp.org.

There is one other header to which you need to pay special attention. As well as the unique ID assigned by each MTA along the delivery route, the message itself has a second ID that is carried with it throughout its passage. For example:

    Message-ID: <[email protected]>

This Message-ID tag was assigned by the mail client used to create the message. These IDs allow you to search for a given message in the log files on multiple servers.

Take a look at some of the legitimate messages in your own Inbox and get a feel for the variation in headers and the steps that messages have to take to get from one place to another.

Forged Headers

Now consider an example where the headers have been forged to make the message appear to come from another source. The following headers are taken from a message that purported to come from the FBI, telling me that I had been visiting illegal web sites. In fact, the message contained a virus and was sent from an infected computer.

    Return-Path: <[email protected]>
    Received: from nvwyu.gov (i528C1073.versanet.de [82.140.16.115])
            by gateway.craic.com (8.11.6/8.11.6)
            with SMTP id j1R0aU702669
            for <[email protected]>; Sat, 26 Feb 2005 16:36:30 -0800
    From: [email protected]
    To: [email protected]
    Date: Sat, 26 Feb 2005 23:17:43 GMT
    Subject: You visit illegal websites
    Message-ID: <[email protected]>
    [...]

At face value, this looks like a message from the FBI with the From, Return-Path, and Message-ID headers all referring to the domain fbi.gov. But the single Received header tells a different story. The message was received by gateway and because I control this machine, I trust it to report the correct IP address of the sending MTA. The hostname within the parentheses is the result of a DNS lookup by my server, so I also trust this. This is clearly not an FBI host. The domain is owned by an ISP located in Germany, and the alphanumeric string used as the hostname (i528C1073) has the look of an address assigned to an subscriber’s computer, most likely at home. Preceding the parentheses is a fictitious domain, nvwyu.gov, which has been created by the sender.

This illustrates how some email headers are easy to forge whereas certain others, generated by trusted servers, can be relied upon. Being able to distinguish between the two is an important skill.

Because the message was generated by a virus infection somewhere on the Internet, there was no need for the originator to hide the identity of the machine that sent the message. Additionally, only one step was necessary to deliver the message, making it impossible to disguise the path it took. Things are very different in the case of spam, where there is perhaps a single source for the messages and the sender really wants to remain incognito. Here are the headers for a piece of spam that touts a pornographic web site:

    Return-Path: <[email protected]>
    Received: from stender.com ([200.217.130.152])
            by gateway.craic.com (8.11.6/8.11.6)
            with ESMTP id j1MHOWl20248
            for <[email protected]>;
            Tue, 22 Feb 2005 09:24:36 -0800
    Received: from inkk.tk (MX-HOST.DOT.tk [195.20.32.78])
            by stender.com with esmtp
            id 1FAAC78CA3 for <[email protected]>;
            Tue, 22 Feb 2005 09:24:37 -0800
    Message-ID: <[email protected]>
    From: "Aggravation E. Envelops" <[email protected]>

The message apparently originated at inkk.tk and was delivered to http://gateway.craic.com, by way of http://stender.com. But things are not as they appear to be. Look at the first line of the top Received header. This was added by gateway, which I trust. The IP address in this line has to have been correct at the time the message was sent; otherwise, the transfer could not have happened. My server has tried to perform a reverse DNS lookup on 200.217.130.152 and failed. Using whois, I can infer that this server is based in Brazil. There is a hostname on that line (http://stender.com) but it is outside those parentheses. If I run dig on that, it returns 216.10.106.149 that, in turn, maps to a network based in Massachusetts. Now that is quite a discrepancy, and it indicates that this hostname is forged.

Once I have encountered an MTA that is forging its identity, then I can no longer trust anything about the Received headers that describe earlier steps in the delivery route. Any professional spammer is going to be using a specialized MTA that can forge these headers to look like anything they want. Most likely they have purchased commercial software that is intended to perform precisely this task.

Forging Your Own Headers

There are good reasons why you might want to forge the headers of your own messages. I have several scripts that run as root and send out notification emails whenever certain events take place. I don’t want people replying to root, so I forge the From address to either my address or that of the recipient. This is a useful technique that illustrates just how easy it is to generate spam.

You can try this for yourself using sendmail on a Unix system. Regular mail clients like Outlook and Thunderbird are not set up to do this. Start by writing a simple message to yourself in a file using an editor. Put your address in the To line and set the From line to whatever you like. In this example, I am going to impersonate someone at O’Reilly. Add a Reply-To header and even make up your own Message-Id. For example:

    To: [email protected]
    From: [email protected]
    Reply-To: [email protected]
    Message-Id: <[email protected]>
    Subject: Test
    Hello World

Tell sendmail to read those headers from the file rather than the command line by giving it the -t flag.

            % /usr/lib/sendmail -t < test_message

The message as received should look similar to this:

    Return-Path: <[email protected]>
    Received: from biotech.craic.com (biotech.craic.com [208.12.16.3])
            by gateway.craic.com (8.11.6/8.11.6)
            with ESMTP id j21NSQ721278
            for <[email protected]>; Tue, 1 Mar 2005 15:28:26 -0800
    Date: Tue, 1 Mar 2005 15:28:21 -0800
    Reply-To: [email protected]
    Message-Id: <[email protected]>
    To: [email protected]
    From: [email protected]
    Subject: Test

    Hello World

While this looks totally convincing when viewed in a mail client, the headers still show the correct Return-Path and hostname for the sender. You can fix the first of these problems by specifying the From address as a command-line option, thus:

            % /usr/lib/sendmail -t [email protected] < test_message

To change the hostname, you need to edit the line in the /etc/hosts file that contains the sender’s IP address. The fake hostname should precede the real one, like this:

    208.12.16.3   bogus.oreilly.com   biotech.craic.com

With both of these in place, the headers of the received message are close to what you want:

    Return-Path: <[email protected]>
    Received: from bogus.oreilly.com (biotech.craic.com [208.12.16.3])
            by gateway.craic.com (8.11.6/8.11.6)
            with ESMTP id j21Mui721208
            for <[email protected]>; Tue, 1 Mar 2005 14:56:44 -0800
    Date: Tue, 1 Mar 2005 14:56:44 -0800
    Reply-To: [email protected]
    Message-Id: <[email protected]>
    To: [email protected]
    From: [email protected]
    Subject: Test

All I would need to do to make this a near perfect forgery is remove the reverse DNS table entry for biotech. It’s that easy.

Tracking the Spammer

Before you take this newfound knowledge and start your own spam empire, bear in mind that spammers are being identified and prosecuted with increasing success. How are the authorities able to track these people down?

What they have that you and I do not is access to the ISPs. Starting with an individual spam message, they can slowly but surely work their way back via the mail server logs at multiple ISPs to identify the original source. It is laborious work, justifying to each ISP that they need to provide access to their logs, search them, document the evidence, and then move one more step back through the chain. That effort goes up by at least an order of magnitude every time the delivery route includes a server in a foreign country. Often that will stop an investigation in its tracks—a fact that has not gone unnoticed by the professional spammers.

sendmail, as well as most other MTAs, can be configured to record information about the messages it handles in log files . The default level of logging in sendmail captures pretty much the same information as the Received headers in the messages themselves. But there is much less opportunity for forgery in these logs, at least as long as the server has not been compromised. More importantly, by examining log files, we might be able to discover groups of related messages being transferred at the same time, indicative of a coordinated spam campaign rather than a single unsolicited message. Distinctions like this are very important in legal proceedings related to spam.

By way of an example, consider the MTA log entries that relate to the forged email that we just created in the previous section. We begin on gateway, the MTA that received the delivered message. A typical location for these log files on a Unix or Mac OS X system is /var/log. We can use the message ID generated on that server to find the matching records.

            % grep j21Mui721208 /var/log/maillog
    Mar  1 14:56:44 gateway sendmail[21208]: j21Mui721208:
         from=<[email protected]>, size=286, class=0, nrcpts=1,
         msgid=<[email protected]>, proto=ESMTP, daemon=MTA,
         relay=biotech.craic.com [208.12.16.3]
    Mar  1 14:56:44 gateway sendmail[21209]: j21Mui721208:
         to=<[email protected]>, delay=00:00:00, xdelay=00:00:00,
         mailer=local, pri=30022, dsn=2.0.0, stat=Sent

Every transfer results in two log file records. The first one records the arrival of the message from biotech, including the address of the sender and the message-specific unique ID. The second entry records the delivery of this message to the mailbox of the recipient. The string stat=Sent is the status of this delivery attempt, which was successful. Both records contain the server-specific message ID, but only the first contains the message-specific ID. That is important when you move to the machine biotech and search its mail log. You don’t have the server-specific ID, so you have to search for the message-specific ID. That returns only one record, but you can locate the server-specific ID from that and use that to get the pair.

    Mar  1 14:56:44 biotech sendmail[16099]: j21Muir16099:
         [email protected], size=158, class=0, nrcpts=1,
         msgid=<[email protected]>, relay=root@localhost
    Mar  1 14:56:44 biotech sendmail[16102]: j21Muir16099:
         [email protected], [email protected] (0/0),
         delay=00:00:00, xdelay=00:00:00, mailer=esmtp, pri=30158,
         relay=craic.com. [208.12.16.5], dsn=2.0.0, stat=Sent
         (j21Mui721208 Message accepted for delivery)

The first record here contains the string . The term localhost is the default name any Unix machine uses to refer to itself, indicating that the message originated on this machine, rather than being relayed from another source. Also, you can see that the real identity of the sender was user root. The second record reports that the message was sent to gateway and that it was received. So with a few simple steps, you have uncovered that the message that claimed to have been sent by in fact came from .

Bear in mind that is a very simple example. There are many ways in which spammers can make tracing the source of their messages difficult or impossible.

Viruses, Worms, and Spam

In some cases, the spammers have been able to hijack the computers of unsuspecting users on the Internet, either by a targeted attack or through virus infections. The Sobig series of worms are widely believed to be an example of this. These are a family of worms that were disseminated across the Internet beginning in 2003. They showed a clear evolution in their design from the first (Sobig.A) through the sixth (Sobig.F), in terms of their ability to sidestep the defenses that were quickly raised against them. That evolution also appears to reflect improvements in the secondary function for the worm, which was to install email proxy servers on infected computers.

Having access to a network of these proxy servers is of great value to the spammers. Not only do they greatly reduce the chance that their identity will be revealed, but by constantly switching between proxies, they can prevent their emails being rejected by the spam blacklist servers. These keep track of machines that have sent large amounts of spam. If any given machine sends only a small number of messages, then it will never be blacklisted.

The evolution of Sobig through its fifth incarnation is summarized nicely in a report by the LURHQ Threat Intelligence Group , which can be found at http://www.lurhq.com/sobig-e.html. For a more detailed technical analysis, written by a group of analysts who have chosen to remain anonymous, you might find this document of interest: http://spamkings.oreilly.com/WhoWroteSobig.pdf. It offers a fascinating insight into the world of virus tracking and even names the individual that the authors believe created the worm.

The networks of compromised machines have been termed Botnets , with individual computers called zombies or bots . Their implications for computer security go beyond spamming to include distributed denial-of-service attacks on target machines and networks. The Honeynet Project and Research Alliance have published a detailed whitepaper about Botnets (http://www.honeynet.org/papers/bots/).

That level of analysis is beyond the scope of this book, but we can use our forensic skills to look at sets of related spam messages and perhaps infer something about the software used to generate the email.

In the face of increasingly sophisticated spam-blocking software, spammers are forced to continually generate unique email messages. Anyone who looks at spam messages will be familiar with the many ways of intentionally misspelling Viagra, oxycontin, etc., along with all the extraneous text that is used to get past spam filters. A similar approach is taken to the message headers. The goal is to continually change the headers so that spam filters can never determine a signature that clearly indicates spam. Most bulk mailers now include this feature. However, while specific strings may be continually changed, the algorithms used to generate them do not and they can serve as unique signatures by themselves. This is an ongoing battle between bulk mailers and spam filters, but you can place yourself at the front line with some simple analyses.

In the earlier section "Forged Headers,” I showed the headers for a spam message about a pornography site. That was one example of a series of similar messages that are clearly from the same source. At the time of this writing, I receive one or two new messages from this series every hour. No two messages have the same sender, but all senders have names like Reuse L. Idahoan, Aggravation E. Envelops, Hatching B. Saunter, and so forth. Right away I can see a simple algorithm at work. Every sender consists of forename, middle initial, and last name. The software probably performs random lookups in a dictionary of names. Similar algorithms are used to generate other headers. The content boundary string, the headers with the X- prefix and a forged Received header, all show clear patterns between the examples.

Most striking is the pattern contained in the Message-ID headers, of which eight examples are shown here.

    Message-ID: <[email protected]>
    Message-ID: <[email protected]>
    Message-ID: <[email protected]>
    Message-ID: <[email protected]>
    Message-ID: <[email protected]>
    Message-ID: <[email protected]>
    Message-ID: <[email protected]>
    Message-ID: <[email protected]>
                     #####   #        #        #

The last line shows a hash mark wherever a character has been conserved in a specific position through all these examples. The dollar signs in each line are of particular interest. They split the string into blocks of 12, 8, and 8 characters before the @ character. In itself, this is a clear signature for the mailing software being used here. It can be used to identify this software being used in other spam campaigns beyond this current onslaught of porn.

In fact, this pattern is so distinctive that I noticed it right away when I read the technical analysis of the Sobig worm that I mentioned earlier in this section. That report includes examples of the message headers generated by the Send-Safe bulk mailing software, all of which match the signature. That software is linked by the authors of the report to the Sobig worm and its installation of email proxy servers on infected machines. When I looked at the addresses of the machines that transferred these spam examples to my server, every one was different, and several had reverse DNS lookups that suggested they were personal machines on cable modems or DSL connections. This is strong evidence that this recent campaign is related to the Sobig infections and may be using the proxy servers created by that worm.

Like the rifling marks found on a bullet at a crime scene, patterns like this are able to link separate incidents in very specific ways.

Message Attachments

While the direct content of a message is displayed clearly in our mail readers, to be read or deleted as we see fit, an attachment poses a dilemma. We cannot easily determine its contents without examining it, but that process alone can expose us to any computer virus that it might contain. This section will explain how you can safely extract the contents of a suspicious attachment and determine their function. Consider this email as an example:

    From: [email protected]
    To: [email protected]
    Subject: Re: Submit a Virus Sample
    Date: Sat, 15 Jan 2005 23:58:39 +0800

    The sample file you sent contains a new virus version of mydoom.j.
    Please clean your system with the attached signature.

    Sincerly,
     Robert Ferrew

    +++ Attachment: No Virus found
    +++ MessageLabs AntiVirus - www.messagelabs.com

Although that sounds vaguely convincing, I’m not going to trust an email from an antivirus company, Symantec, which appears to screen its messages with software from its competitor, MessageLabs. We can assume that the attached file, datfiles.zip, contains a virus or something equally nasty. How can we isolate the payload and figure out what it represents?

Warning

It should go without saying that you should not attempt any extraction or analysis of viruses, worms, or spyware on any Windows system.

On a Unix system, download the entire email message into a new directory and look at the text. Here are the relevant lines from our example. It has three parts: the mail headers, the text of the message, and a large block of encoded text.

    From: [email protected]
    To: [email protected]
    Subject: Re: Submit a Virus Sample
    Date: Sat, 15 Jan 2005 23:58:39 +0800
    MIME-Version: 1.0
    Content-Type: multipart/mixed;
            boundary="----=_NextPart_000_0016----=_NextPart_000_0016"

    This is a multi-part message in MIME format.

    ------=_NextPart_000_0016----=_NextPart_000_0016
    Content-Type: text/plain;
            charset="Windows-1252"
    Content-Transfer-Encoding: 7bit

    The sample file you sent contains a new virus version of mydoom.j.
    [...]

    ------=_NextPart_000_0016----=_NextPart_000_0016
    Content-Type: application/octet-stream;
            name="datfiles.zip"
    Content-Transfer-Encoding: base64
    Content-Disposition: attachment;
            filename="datfiles.zip"

    UEsDBAoAAAAAAEtqLzKjiB3egHMAAIBzAABTAAAAZG9jdW1lbnQudHh0ICAg
    ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
    [...]
    ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIC5l
    eGVQSwUGAAAAAAEAAQCBAAAA8XMAAAAA

    ------=_NextPart_000_0016----=_NextPart_000_0016--

The Content-Type header line tells us that the message is in MIME format with multiple parts in potentially different formats:

    Content-Type: multipart/mixed;
            boundary="----=_NextPart_000_0016----=_NextPart_000_0016"

It also tells us the string that is used to mark the boundaries between the different parts. It doesn’t matter what the string is, as long as it doesn’t occur in the real text of any part. Typically these are long cryptic strings such as the one used here:

    ----=_NextPart_000_0016----=_NextPart_000_0016

Looking through the message, we can see three lines that match this string. These are the boundaries of the two parts to this email, which are the text of the message, followed by the encoded attachment. The third instance of the boundary string is slightly different. It ends with two dashes. This signifies that there are no more parts to the message after this.

Each part of the message, defined by these strings, has its own header lines that tell us what format it is in. The headers for the message part are:

    Content-Type: text/plain;
            charset="Windows-1252"
    Content-Transfer-Encoding: 7bit

These tell us this block of content is plain text using a specific character set, which in this case is standard for languages that don’t require any special characters. This would be different if the text used, say, Japanese characters. More interesting are the headers for the attachment:

    Content-Type: application/octet-stream;
            name="datfiles.zip"
    Content-Transfer-Encoding: base64
    Content-Disposition: attachment;
            filename="datfiles.zip"

Here the content type is application/octet-stream, which means that it is an encoded version of the original. Encoding is a way to represent binary data, such as executables or images, as simple ASCII text that can be transmitted via email. The particular encoding used here is given in the Content-Transfer-Encoding header and is Base64, which is perhaps the most common type. I talk a bit more about Base64 in Chapter 4 in the context of disguising information. The Content-Disposition header tells us the filename that should be used if and when the attachment block is saved to disk in the recipient’s email client. These headers are followed by a large block of indecipherable characters, which represents the encoded attachment.

To reveal what this contains, you need to decode this block. Your email client will do this for you but, as that is the way in which the payload of a virus is normally installed, you need to take a more cautious approach.

A simple and effective tool for this purpose is munpack , which was written by John G. Myers at Carnegie Mellon University. It can be downloaded, along with its partner mpack , from ftp://ftp.andrew.cmu.edu/pub/mpack/. The tools are compiled and installed on a Unix or Mac OS X system in a default location by the commands make and make install. Windows users will find binary executables at a number of download sites.

munpack is very easy to use. Given the name of the file containing your email, it will extract the attachment and report the name of the file it saved its contents to.

            % munpack virus_sample.eml
    datfiles.zip (application/octet-stream)

It actually creates two files: datfiles.zip and one called datafile.desc. The latter contains the contents of the message part of the email.

Having successfully extracted the payload from its delivery mechanism, you can now focus on what it contains. The .zip suffix suggests that it is a zip archive containing one or more files. But why should you trust that? The standard Unix command file can help us here. It knows about a wide range of file types and uses several approaches to make a best guess. You simply pass it the filename:

            % file datfiles.zip
    datfiles.zip: Zip archive data, at least v1.0 to extract

This does indeed appear to be a zip file, so let’s unpack it and see what’s inside. unzip is a standard Unix program that will take care of this. Windows users can use an equivalent tool, such as winzip or pkunzip . If you want to play it safe, then create a new directory, move the zip file into that and unpack it there so as not to overwrite any other files that might have the same names. To be especially cautious, you can have unzip list the files first without extracting them using the -l option:

            % unzip -l datfiles.zip
    Archive:  datfiles.zip
      Length     Date   Time    Name
     --------    ----   ----    ----
        29568  01-15-05 13:18   document.txt
                               .exe
     --------                   -------
        29568                   1 file

This tells us the file contains a single file called document.txt...or does it? Actually it is a single file called document.txt .exe, where the .txt and .exe are separated by 67 spaces. This trick is often used in virus or spyware attachments. By padding out the filename with whitespace the creator hopes that you will not notice the .exe suffix that indicates that it is an executable. For the sake of readability, I’ve renamed the file to document.txt.exe in the following paragraphs.

Now let’s throw caution to the wind and actually unzip the file and then run file on its product:

            % unzip datfiles.zip
    Archive:  datfiles.zip
     extracting: document.txt.exe
    % file document.txt.exe
    document.txt                                                                   .
    exe: MS-DOS executable (EXE), OS/2 or MS Windows

This confirms the suspicion that this is a Windows executable file. Now, we’re getting pretty close to what is most likely a virus. While it may have no effect on a Linux or Mac OS X system, I just don’t want to push my luck by trying to run the program and seeing what happens. And, of course, if you are doing this on a Windows system then don’t run it! Not only that, but if you use Samba to share filesystems between Unix and Windows, then make sure no one is able to run it from the Windows side by accident!

We can go a bit further without risking any damage. Although most of the content of an executable program is binary, there are often text strings embedded therein. These represent things such as error messages, library names, and so forth. We can look for these using another standard Unix program called strings. This will interpret a binary file as text and output any strings of at least four printable characters that it finds. You will want to pipe the output into more as it produces a lot of garbage, but hidden in there are real words and, sometimes, complete sentences. To see what it can reveal about a regular program, try it out on a standard Unix program:

            % strings /bin/sh | more

Running it on our suspect file produces a large amount of output, of which a sampling is shown here:

            % strings document.txt.exe | more
    !Windows Program
    KERNEL32.dll
    LoadLibraryA
    GetProcAddress
    bAZD$
    +;_+
    RyR
    [...]
    CU'l
    nfig9x.dql
    Protec
    KERN`L32.dql
    [...]

There is not a lot of recognizable text, but there are a few interesting things. The first few lines presumably refer to Windows linked libraries, then we get into all the gobbledygook. But, down near the bottom is the word “Protec”. That looks out of place and worth running through Google to see what is known about it. Sure enough, there is a worm called Protec.B listed on the web sites of antivirus companies, so perhaps this is an instance of that payload.

Windows users do not have the tools file or strings built in to their operating system. This can be addressed by installing the Cygwin package (available at http://www.cygwin.com/), which provides Windows equivalents of most common Unix command-line tools.

Delving any deeper into the dissection of viruses and worms would be beyond the scope of this book. But you can learn a lot by applying these simple Unix commands to the attachments that you come across in your Inbox. Look at a few examples of viruses or worms and you will notice similar approaches taken by their authors to their packaging and the naming of files. Even more interesting can be attachments that attempt to install spyware. Dissecting these can lead to a series of files that would, if they got the chance, install themselves on a Windows system and seriously impact its performance. To learn more about the disassembly of binary executables and similar techniques, you might want to look at Security Warrior by Cyrus Peikari and Anton Chuvakin (O’Reilly).

Message Content

From a forensics perspective, the content of a message is actually the least interesting part. If the message carries a virus or spyware, then the payload will be contained in the attachment. If it is a phishing attempt, then the web site it links to is where your interest will lie.

The experts in spam analysis and filtering can do a far better job than I at describing the techniques they use to classify messages and decide if they represent spam or not. This is a fascinating area that combines advanced computer science, with its statistical and pattern recognition algorithms, and practical software engineering that builds and deploys tools in an ongoing battle with the spammers.

There are three main approaches to dealing with spam. Here are resources to each of these that you might find useful. Rule-based filtering looks for specific strings and signatures within a message and assigns a score based on the matches it finds. SpamAssassin is a leading open source tool that uses this approach (http://spamassassin.apache.org/). Statistical filtering, using Bayesian analysis, looks at things like word frequencies in sets of messages that have been manually classified as spam or not, typically by the end user. As such it reflects their personal interests and can adapt to changes in the types of email that an individual receives. This is the approach taken in the Thunderbird email client, among others. A good introduction to Bayesian filtering is this paper by Paul Graham: http://www.paulgraham.com/spam.html. If spam can be traced back to a specific network address, then that address can be added to a Block List, or blacklist, of known spammers. A mail server can look up the address of each MTA that wants to transfer a message and automatically reject those that are on the list. This approach will become less effective in the face of proxy servers that were created by the Sobig worms. The Spamhaus Block List is a leading example of this approach, and their web site is an excellent resource: http://www.spamhaus.org. The problem facing block lists is that they can only react to addresses that have been used repeatedly to send spam. As I show in Chapter 11, spammers are able to use large networks of hijacked computers such that no one address is used enough to be included in the block lists.

Believe it or not, not everyone receives spam. Should you be in that enviable position and want to see what you are missing out on, you can find an archive of the stuff at http://www.spamarchive.org/. This can also be a great resource for anyone wanting to test spam-filtering software.

I return to the subject of message content in Chapter 4, specifically discussing the many ways in which phishing attempts try to disguise the real URLs of their fake web sites. I will end this chapter with the speculation that some spam may not be what we think it is.

Is It Really Spam?

The amount of spam that I receive everyday is absurd. All spam is stupid, but some is more stupid than others, and it amazes me how many emails I get from the widows of Sonny Abacha, Yassir Arafat, and various oil company executives, all offering a piece of the action if I help them transfer their millions out of their respective countries. These are the so-called 419 advance payment scams that we are all familiar with. At this point almost everyone on the planet must know about the scam and so you would think this type of email would be on the decline. But I seem to get more of them every day. Perhaps there is more to it than meets the eye.

One theory is that some of these are not spam at all. Embedded within their usual colorful prose are hidden messages that will only be noticed by those who know where to look. The rest of us will treat the emails as spam and ignore them.

In principle, it’s a simple and effective way to broadcast secret messages to members of a criminal gang or terrorist group. Anyone monitoring Internet traffic, even if they focused on emails received by a single address, would find it difficult to distinguish one piece of fake spam from the torrent of real spam that many of us receive every day. Even having achieved that, it would be impossible to identify the intended recipient among the thousands of other people who received the same message.

Spy novels from the Cold War era were full of agents passing messages to one another via cryptic classified ads in the back pages of the Times. Fake spam could well be the modern equivalent.

The ways in which a secret message could be embedded in an email are countless. The message ID string could represent a phone number. The first letters of each line could form a sentence. The pixels of a photograph could contain hidden text. These are all examples of steganography , an approach to hiding information in plain sight that has been used since the days of ancient Greece. Whereas encryption makes the content of a message unreadable to everyone but the sender and the recipient, steganography hides the message within a larger block of information. The two approaches are complementary. Steganography has received a lot of attention in recent years as a way to embed information within photographs or audio tracks. For example, it is possible to change the low order bits of pixels in a photograph with no noticeable impact on the image quality. Algorithms exist that embed a message throughout the image and that can extract the message at a later date from a copy of the image, or even a fragment thereof in certain cases. The hidden message can represent a copyright statement and be used to track the illegal copying of images.

Text is a poor substrate for steganography compared to images. If you mess with the bits of any character, then you get a different character and the text will not make any sense. Instead you need to define sets of equivalent words and phrases and use the information content of the hidden message to direct the selection from those alternatives. This might appear overly complicated, but you can experiment with the concept courtesy of the web site http://www.spammimic.com . SpamMimic is based on an idea by Peter Wayner and uses a grammar derived from sentences typically found in spam. On their web site, you can enter the text of your secret message and their algorithm will use it to assemble a realistic looking piece of spam. The bits of information from your message are embedded throughout the resulting spam in such a way that it can be decoded by pasting the text back into the web site. The system has a very low capacity for embedded information—in contrast to a photograph, for example—so it works best with short messages. Here is an example of the spam it generates, giving the message “Meet me at 8”:

    Dear Friend , This letter was specially selected to
    be sent to you ! We will comply with all removal requests
    ! This mail is being sent in compliance with Senate
    bill 1621 ; Title 5 ; Section 303 ! Do NOT confuse
    us with Internet scam artists . Why work for somebody
    else when you can become rich within 38 days ! Have
    you ever noticed people are much more likely to BUY
    with a credit card than cash & nearly every commercial
    on television has a .com on in it ! Well, now is your
    chance to capitalize on this ! We will help you sell
    more & SELL MORE ! You can begin at absolutely no cost
    to you . But don't believe us . Ms Anderson of New
    Mexico tried us and says "Now I'm rich many more things
    are possible" . We assure you that we operate within
    all applicable laws . DO NOT DELAY - order today .
    Sign up a friend and your friend will be rich too !
    Best regards .

If this message arrived in my Inbox, I would definitely treat it as spam and delete it, unless I knew to look out for it.

It is a fascinating area of technology, but is there any evidence that the technique has actually been used? If you search Google, you will find plenty of people suggesting that it can and does occur, but no hard evidence as yet. In the era of global terrorism, this must be a growing concern for those at the National Security Agency and others tasked with monitoring electronic communication.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.157.45