Chapter 6. Filtering

It's been said before, and it's worth repeating: qmail is a very modular email architecture. Because of this modularity, it is relatively easy to alter the behavior of the overall system by wrapping the basic components or by inserting a script or program between them. Filtering email is a perfect example of the power of this design. This is done by filtering the communication between architectural components; so while filtering email is the primary operation discussed in this chapter, filtering architectural interfaces is the method by which this expansion or modification of the architecture is achieved.

Basic Filtering Architecture

The basic qmail architecture, trimmed down to just the parts relevant to delivery (and thus filtering) of email, is shown in the following figure:

Basic Filtering Architecture

Almost any of qmail's components can be wrapped and used for filtering purposes. Which components to wrap depends on the specific behavior desired. In many cases there are multiple ways of achieving the same thing and choosing which method to use requires planning. For example, if some email needs to be blocked or rejected, it is better to catch that email earlier in its path through the system rather than later. This reduces the amount of time and resources spent on email that is not delivered. Thus, the most common place to block mail is before it is queued for delivery. Filtering mail (i.e. modifying it) is often done in multiple places, depending on what kind of filtering is desired. For example, filtering for spam is sometimes done before mail is queued, though this makes it harder to implement user-specific filtering rules completely. If user-specific filters are desired, filtering at the delivery stage is often preferable.

Exactly where a component is inserted affects what filtering options are available. For example, a wrapper around qmail-queue can examine the full content of an email, but does not know precisely where the email came from. A wrapper around qmail-smtpd knows exactly where the email originated, but in order to inspect the content of the email the wrapper must process SMTP itself, making the wrapper far more complicated and prone to security vulnerabilities. The following figure describes typical filters for each location in the architecture.

Basic Filtering Architecture

Connection decisions determine whether to allow a connection or not, and are based solely on information about the connection (such as the client's IP address and port number). A basic example of this is provided by the tcprules files discussed in Chapter 1. A content gateway is a filter that decides whether to allow a given email to continue to the next stage of delivery based on the content of that email. A content modifier, on the other hand, is a filter that always allows the email through to the next stage, but usually alters the email's content as it does so. In practice, the distinctions between these types of filters are often blurred, as some filters can both block and modify. Examples are given throughout the rest of this chapter.

The reasons for wrapping each component are relatively straightforward.

  • qmail-inject: A wrapper around qmail-inject can make decisions about which users are allowed to send email, can prevent users from sending spam or viruses, can help fill out form-letter emails, or something similar.

  • qmail-smtpd: A wrapper around qmail-smtpd can set useful environment variables (such as RELAYCLIENT), check for protocol conformance, or check the client against DNS blacklists, among other things.

  • qmail-queue: A wrapper around qmail-queue can check messages for viruses, spam, valid DomainKeys signatures and similar tricks, and can even feed modified versions of the email to qmail-queue instead of the original.

  • qmail-remote: A wrapper around qmail-remote is useful for making sure that all outbound email is properly signed with a DomainKeys signature or similar email modifications.

  • qmail-local: A wrapper around qmail-local is similar in purpose to a qmail-queue wrapper. The difference is that by the time the message gets to this point in the delivery process, it is a copy of the message for a single recipient. This is a useful thing to know if different recipients have different filtering preferences. And, of course, qmail-local can deliver to programs like procmail and maildrop that both filter and discard mail according to recipient preferences.

In addition to wrapping each component, of course, the components themselves can be modified to perform a task, though this usually requires more programming experience. However, if speed is a consideration, modifying components is often faster than wrapping them.

The most commonly wrapped components are qmail-queue and qmail-smtpd, because the most common filters intercept spam and viruses. In addition to hand-written wrapper scripts, there are several popular qmail-queue wrappers available that provide an array of filtering options, including:

These wrapper programs generally save incoming email to a file; feed it to other filtering or scanning programs (e.g. to detect viruses); and, depending on the result of these external programs, either delete, quarantine, or reject the email or feed it back to qmail-queue for delivery. They also can extract any MIME-encoded attachments from the email for separate scanning. Qscanq is the simplest of the three, supporting few virus scanners and no inherent spam scanning. It either silently destroys virus-laden email or rejects it. Inter7's simscan is more complex, supporting several virus scanners as well as the SpamAssassin's spamc scanner. simscan either rejects or destroys virus-infected messages and conditionally rejects, deletes, or passes-through messages based on their SpamAssassin score. The qmail-scanner wrapper is more complex, and because it is written in Perl rather than C, has more per-message overhead than the other wrappers. However, this wrapper supports almost all popular virus scanners, message quarantining, SpamAssassin analysis, internal pattern matching, and many other features. The most versatile wrapper is qmail-qfilter. It provides a mechanism to run arbitrary programs or scripts on each message and rejects, deletes, or passes through messages based on the result of each program.

The qmail-smtpd program is, unlike qmail-queue, more often wrapped with simple scripts and small, single-purpose programs instead of multi-purpose utility programs. As an example, included in the ucspi-tcp software package is a program called rblsmtpd. This program relies on the environment variable TCPREMOTEIP, as defined by tcpserver. It will look up that IP address in a DNS-based blacklist (specified on the command line) and, depending on whether the IP is listed in the blacklist or not, will either print an SMTP rejection message or execute the program specified in its arguments—usually, qmail-smtpd. For example, rblsmtpd can be used as follows:

tcpserver -u `id -u qmaild` -g `id -g qmaild` 
0 smtp rblsmtpd -r some.blacklist.domain qmail-smtpd

In addition to wrappers, there are also several drop-in replacements for qmail-smtpd. For example, mailfront (http://untroubled.org/mailfront) by Bruce Guenter, supports SMTP-AUTH, virus scanning, sender/recipient filtering, and many other features. Linux Magic's magic-smtpd (http://www.linuxmagic.com/opensource/magicmail/magic-smtpd) is similar and includes features like SMTP tarpitting, user validation, TLS, SMTP-AUTH, and many others. These programs would be used as follows (using mailfront as an example):

tcpserver -u `id -u qmaild` -g `id -g qmaild` 
0 smtp mailfront

The structure of these commands can be a bit confusing. The tcpserver program sets up the network connection and then runs a single program. tcpserver's view of its arguments is that they are in four ordered categories:

tcpserver tcpserverArguments aProgramToRun argumentsToThatProgram

The way it identifies what program to run is, by simply assuming that the first argument that it doesn't understand (and doesn't start with a hyphen) is the name of a program and all subsequent text blocks are arguments to that program. For example, when using rblsmtpd with qmail-smtpd, tcpserver views it in the following manner:

tcpserver tcpserverArguments rblsmtpd argumentsToRblsmtpd

tcpserver knows nothing of qmail-smtpd or any other program; but because it cannot identify rblsmtpd it must therefore be the name of a program. Once the program runs, tcpserver has no further control over it. rblsmtpd, behaves similarly. When it runs, it views its arguments in four ordered categories:

rblsmtpd rblsmtpdArguments aProgramToRun argumentsToThatProgramIfAny

When it's used with qmail-smtpd, it sees this:

rblsmtpd rblsmtpdArguments qmail-smtpd argumentsToQmailSmtpdIfAny

Once rblsmtpd runs and makes its decision, it then executes whatever program the text qmail-smtpd identifies. Like tcpserver, when rblsmtpd runs a program, it has no control over it; unlike tcpserver, which spawns a child to run (exec) the program, rblsmtpd runs (execs) the named program itself, and the new program completely replaces rblsmtpd in memory.

This telescoping, cascading, or chaining behavior—where the wrapper does some action and then runs (execs) the next program in the chain—is a common wrapping technique.

Sending Mail Without a Queue

One of the nice things about having such a modular architecture is that pieces of the architecture can be rearranged or removed. A good example of doing this is removing the queue from the picture, so that messages are sent immediately to another email server. There are several ways of achieving this end depending on the specific behavior required. The two primary methods are described here.

Dr. Bernstein's website has simple directions (http://cr.yp.to/qmail/mini.html) for setting up a queue-less qmail installation that uses the Quick Mail Queueing Protocol (QMQP) to transmit email messages to another qmail server (or any server that understands QMQP). He calls this setup mini-qmail. QMQP (similar to QMTP) avoids some of the latency of SMTP, and optimizes message transmission. However, only qmail currently supports QMQP, and so it may not be the best option in a mixed environment. These instructions direct that the qmail-queue binary be replaced with the qmail-qmqpc binary. This program uses the same interface as qmail-queue but rather than queuing messages, it transmits them to another server via the QMQP protocol, as directed by the control/qmqpservers file (details are in the qmail-qmqpc man page).

Creating a queue-less qmail installation that uses SMTP is a little more work than the QMQP-only setup but it can cooperate with servers that do not support QMQP. The easiest way to do it is, similar to mini-qmail, to replace qmail-queue with a program that will instead transmit messages to a remote server. Such a program, akin to qmail-qmqpc, must support the qmail-queue interface. The qmail-queue replacement program does not, however, need to interact with the network; it can simply use qmail-remote to transmit messages via SMTP. For example, the following script could work:

#!/bin/bash
# read the envelope information first (from file descriptor 1)
read -u 1 -d $'' sender
i=0
while read -u 1 -d $'' recipient ; do
[ -z "$recipient" ] && break
recipients[$i]="$recipient"
i=$(($i+1))
done
# Now, generate the Received header,
# feed the message to qmail-remote, and capture the output
printf 'Received: (qmail %i invoked by uid %i); %s
%s
' 
"$$" "$UID" "$(date '+%d %b %Y %k:%M:%S %z')" "$(cat -)" | 
/var/qmail/bin/qmail-remote 
$SMARTHOST "$sender" "${recipients[@]}" | 
while read -d $'' result ; do
case "$result" in
K*) # success
exit 0;;
Z*) # temporary failure
exit 71;;
D*) # permanent failure
exit 31;;
esac
done

For this script to work, the SMARTHOST environment variable must be defined or $SMARTHOST must be replaced by the correct host (either a hostname or a square bracketed IP address) to send mails to.

Blocking Viruses

The most common means of transferring viruses to new vulnerable computers is through email. Originally, viruses preferred to invisibly infect innocuous files that were later transferred from computer to computer at the behest of a person sharing files for legitimate purposes. While this can still happen, the overwhelming majority of virus-laden email is sent by autonomous viruses that either have randomly generated recipients or have found recipient addresses on the infected computer. More to the point, these emails generally contain nothing of value. With that change in behavior, the best way to handle emails containing viruses has also changed. In the past, the expectation was that email containing a virus should be modified, the virus stripped out, and the sanitized message delivered to its original destination. These days, infected emails are so rarely legitimate that rather than delivering them, the most common response is simply to delete the infected messages entirely.

Both policies, however, rely upon detecting infected emails, for which there are many options.

Heavyweight Filtering

Heavyweight filtering is the filtering that most often comes to mind for virus eradication. Each email is decoded, attachments if any are separated, and all of the component parts of the email are scanned with virus scanning software to determine if they contain a virus. Such scanning is considered heavyweight because of the sheer enormity of the task. Virus scanners maintain large databases of every known computer virus and how to detect it. There are millions of different species of computer viruses, and new viruses are discovered almost hourly. A good virus scanning installation must maintain this large database, test for every single virus in that database, and also keep this database updated. Usually these tasks are automated, but this scanning method is time-consuming for every message. In high-load situations, such virus detection might not be feasible.

Some of the most common virus scanners are:

  • AVG Anti-Virus

  • Trend InterScan VirusWall

  • F-Prot Antivirus scanner

  • NAI/McAfee scanner

  • H+BEDV's AntiVir scanner

  • Kaspersky's AVPLinux scanner

  • Command's virus scanner

  • the F-Secure Anti-Virus scanner

  • the InocuLAN Anti-Virus scanner

  • BitDefender Linux Edition

  • Central Command's Vexira anti-virus scanner

  • the ESET NOD32 Anti-Virus scanner

  • the open-source Clam Anti-Virus (ClamAV) scanner

These scanners must be hooked into the qmail architecture to scan each email. The most common way of doing this is with a wrapper around qmail-queue. Some virus scanners come with a qmail-queue wrapper; however, most require a separate wrapper, such as those discussed previously.

Lightweight Filtering

There are other options for protecting email users from viruses that do not require a heavyweight approach or can reduce the load on a heavyweight virus scanner. The simplest options start with policy decisions, such as banning certain types of files. For example, most viruses are Microsoft Windows executable files—a fact that has made banning executable files from email a popular policy. Simple filters that reject email containing attachments whose names end in .exe (filters such as qmail-scanner) can be effective, however, not all Microsoft Windows executable file names end in .exe. Indeed, they can actually use almost any suffix and still be recognized as executable by Windows. However, all Microsoft Windows executable files begin with information that tells Windows how to load the program. This information is the same in all Windows programs and is required in order to run the file. It is small, always at the beginning of the file, and thus easily detected. There is a small patch for qmail-smtpd, written by Russ Nelson (http://www.qmail.org/qmail-smtpd-viruscan-1.3.patch) that makes it easy to enforce a policy banning Windows executable files from email. This patch requires very little overhead and is effective even in high-load situations. To hide from such filters, some viruses send themselves as zip-compressed archives and rely on the recipient to uncompress the virus and run it. ZIP files, like executable files, all begin with a similar pattern and can be identified and banned with Russ Nelson's patch. Be warned, however, that some file formats—such as some OpenOffice files—are, unbeknownst to the user, really zip-compressed collections of several components, and banning all ZIP files also bans these files.

Not all viruses are executable files so Russ Nelson's patch is not a complete solution. For example, scripts (like Visual Basic .vbs files) and Microsoft Office macro-viruses are not technically executable files, and so cannot be blocked by this patch. Additionally, it might not be feasible to ban executable files and/or ZIP files from email. However, when such measures are possible, simple filters such as Russ Nelson's patch or a suffix-detector provide an effective first line of defense against viruses.

Stopping Spam from Getting In

Eliminating spam is one of the most important tasks of today's email administrators. There are two equally important facets to eliminating spam: preventing it from being sent by your server and preventing it from being delivered to your users. Of the two, preventing it from being delivered is often the hardest.

Sender Validation

Strictly speaking, sender validation is not an anti-spam technique, though it is often regarded as such. One of the interesting details of the SMTP protocol is that the sender of a given message is not restricted to the address of the actual sender. A person sending a message can specify any return address, just as they can specify any destination address. In a trustworthy environment, where no one has a reason to hide his or her identity, this is not a problem. However, on today's networks, viruses, scammers, spammers, and so-called phishers all wish to hide their identity when sending email.

The reason sender validation is usually considered an anti-spam technique is the belief that if spammers could not hide their identity, spam would be easier to block. However, it is important to recognize that a spammer can send spam that correctly identifies himself or herself as the sender. Accurate sender addresses do not make the mail any less spammy, but makes messages from known spammers easy to refuse and messages from known non-spammers impossible to mistake for spam. Unfortunately, qmail-compatible software for blocking specific validated senders does not exist.

There have been many ideas for reliably verifying email senders. The two most popular are the Sender Policy Framework (SPF) and DomainKeys. DomainKeys is an earlier version of DomainKey Identified Mail (DKIM), which is not as widely deployed or supported as DomainKeys. However, given that it will shortly (as of this writing) be approved by the IETF, DKIM will eventually be widely implemented.

SPF

The basic idea behind SPF is that a domain should specify (in DNS) precisely which servers are authorized to send its email. Thus, an email claiming to be from a particular domain (such as ebay.com) but not originating from one of that domain's approved servers is considered a forgery. This works well in most circumstances, but its drawback is it breaks a widely used feature of email called forwarding, and not just for SPF users, but for anyone contacting an SPF-using domain.

People often have email addresses that are merely forwarding addresses such that all mail sent to that address is resent to another address. On virtually every email server prior to the invention of SPF, forwarded email was delivered unmodified. The problem with this approach has been illustrated with the help of the following figure:

SPF

Imagine that the address [email protected] is a forwarding address that forwards to [email protected]. If a message is sent from [email protected] to [email protected], the example2.net server then sends the message to the example3.org server. From the example3.org server's perspective, the example2.net server is sending email claiming to be from example.com. If example.com happens to publish an SPF record asserting that its email only originates from its own servers (which it did), and if example3.org blocks messages based on SPF information, it will refuse the [email protected] message forwarded by the example2.net server. This thwarts a viable (and popular) method of managing email accounts. The SPF workaround is to re-encode email return addresses so that each server only sends from its own domain. For example, example2.net could re-encode the return address to look like [email protected], so the message does not appear to be claiming to be from example.com. If the message bounced back from example3.org for some reason, example2.net would receive it and could then decode the return address and bounce it back to the original sender, [email protected]. The problem with this solution is that a spammer could convince the example2.net server to relay spam by sending spam to [email protected] or a similar address. To prevent such abuse, each server will need to remember every email it forwarded for every user, so that it can refuse to relay bounces (or spam that looks like a bounce) that are for messages it didn't send. To be more to the point, each server will have to remember all of these messages until it can be certain that these emails will not bounce, which is an unpredictable amount of time. For example, the example2.net server cannot know whether the example3.org server is the message's final stop or whether it will be forwarded again and then bounce—keep in mind that a message can sit in a server's queue for a long time (typically up to seven days) before it is bounced.

DomainKeys

The DomainKeys concept is to cryptographically sign all messages with a public key encryption system, where the signature asserts that a server approved by the domain has sent the email. The information necessary to verify the validity of the signature is published in DNS, so that any server may validate the signature and prove that the message was signed by a server approved by that domain to send email. Thus, for a domain that asserts (in a DomainKeys policy record stored in DNS) that it signs all of its email, a message missing a signature or containing an incorrect signature is considered a forgery. Unlike SPF, this allows emails to be forwarded without being modified. The caveat is that DomainKeys requires that messages not be modified (between the DomainKeys signature and the end of the message). Thus, while Received headers and any other headers prepended to the message are acceptable, other modifications—such as spam-filter headers or signatures appended to either the end of the message or the end of the headers—can invalidate the DomainKeys signature. The DomainKeys specification includes the ability to name, in the signature (in an h= tag), what headers were originally included in the signature. This makes the signature much more tolerant of header modification, as additional headers (such as spam-filter headers) inserted in the wrong place in the message can be excluded from the signature. However, the body of the message and the headers that were included in the signature cannot change without invalidating the signature.

Russ Nelson has contributed significantly to qmail's ability to verify and sign messages with DomainKeys. Most significantly, he is the original author of the libdomainkeys library (http://domainkeys.sourceforge.net), which provides many ways of creating and validating DomainKeys signatures.

Russ Nelson also wrote a patch to the qmail source that creates a new component, qmail-dk (http://www.qmail.org/qmail-1.03-dk-0.53.patch), that can be used as a wrapper around qmail-queue for signing and verifying DomainKeys signatures. This program is convenient, but not always necessary. The libdomainkeys package includes a utility called dktest that can, with the appropriate shell-script glue, provide some of the basic features that qmail-dk provides. For example, to verify DomainKeys signatures in incoming email and add a header proclaiming the result, a script like the following will work:

#!/bin/sh
[ "$DKQUEUE" ] || DKQUEUE=/var/qmail/bin/qmail-queue
if printenv | grep -q '^DKVERIFY=' ; then
tmp=`mktemp -t dk.verify.XXXXXXXXXXX`
cat - >"$tmp"
(dktest -v <"$tmp" 2>/dev/null | awk 'NR>1' ; 
cat "$tmp" ) | "$DKQUEUE"
retval=$?
rm "$tmp"
exit $retval
else
exec "$DKQUEUE"
fi

This script can, in conjunction with the QMAILQUEUE patch (or as a qmail-queue replacement, provided that the real qmail-queue can still be used with a different name) tag each incoming email with a "DomainKey-Status" header, recording the results of checking the DomainKeys signature. The DKQUEUE environment variable is consulted for the location of the qmail-queue binary, and the DKVERIFY environment variable is used to enable verification.

The qmail-dk program serves two purposes: it verifies signatures (like the above script), and it creates signatures. Because qmail-dk is a qmail-queue wrapper, it must decide whether to sign a message when the message is queued. In order to sign all messages, every part of the system that queues messages must use it—from qmail-smtpd to qmail-inject to qmail-local (which queues messages in response to delivery instructions). Unfortunately, qmail-dk does not support the inclusion of an h= tag in the signature.

An alternative to wrapping qmail-queue with something like qmail-dk is to wrap qmail-remote. This signs all remotely delivered messages, no matter how they were enqueued. While qmail-dk does not serve easily as a qmail-remote wrapper, the following script (using dktest) is a suitable option.

This script, which supports the h= tag, presumes that its name is qmail-remote and that the original qmail-remote was renamed qmail-remote.orig:

#!/bin/bash
[ "$DKSIGN" ] || DKSIGN="/etc/domainkeys/%/default"
[ "$DKREMOTE" ] || DKREMOTE=/var/qmail/bin/qmail-remote.orig
if [[ $DKSIGN == *%* ]] ; then
DOMAIN="${DOMAIN:-${2##*@}}"
DKSIGN="${DKSIGN%%%*}${DOMAIN}${DKSIGN#*%}"
fi
if [ -f "$DKSIGN" ] ; then
tmp=$( mktemp -t dk.sign.XXXXXXXXXXX )
cat - >"$tmp"
(dktest -s "$DKSIGN" -c nofws -h <"$tmp" 2>/dev/null | 
sed 's/; d=.*;/; d='"$DOMAIN"';/'; 
cat "$tmp" ) | 
"$DKREMOTE" "$@"
retval=$?
rm "$tmp"
exit $retval
else
exec "$DKREMOTE" "$@"
fi

With this script, the environment variable DKSIGN specifies the location of the private key for the signing process, DKREMOTE specifies the location of the original qmail-remote program, and finally, DOMAIN specifies the sending domain (if it is unspecified, the script will assume the domain is the sender's domain). More conveniently, just like qmail-dk, DKSIGN can contain a percent sign (%) that is replaced by the signing domain name, thus making signing for multiple domains with separate keys more convenient.

Identifying Spam

Identifying spam is a complex task relying on one fundamental assumption: spam is not normal email and is not sent in the normal way. Now, at one level, this assumption seems obvious: of course spam is abnormal, and of course spam is sent by bulk mailers or virus bots or other unusual software. Nearly all of the effective anti-spam tactics rely on one or both of these details. For that reason, the distinction between real email (or ham) and spam email is precisely what spammers attempt to blur. Once spam is identified, it can either be tagged or blocked.

Lightweight

Lightweight methods for identifying spam are considered lightweight because they do not rely on complex analysis of email in order to make decisions, and the decisions are almost always binary: each email is accepted or rejected, with no gray area in between. How this decision is reached generally depends on how much information is available when the decision is made. In some cases, email is considered spam before the sender has sent even a single character of the email. If it is blocked at that point, bandwidth is not wasted in receiving the message. The methods described here are not an exhaustive list, but merely a set of popular examples.

Domain Name System Black-Lists

The technology known as Domain Name System Black-List (DNSBL) is a method for quickly deciding which messages are not acceptable. Essentially, the idea is this: spammers have a limited number of computers, and once these are identified, no messages from them need to be accepted. This of course makes the underlying assumption that messages from spammer-owned computers are always spam, and messages from other computers are never spam. The lists of computers known or suspected to be under spammer control are stored in a public database (namely, the DNS database). When a computer attempts to contact an email server using a DNSBL, it is checked on in the public database. If the connecting computer is listed as a spammer, the attempt is rejected.

Unfortunately, these public databases are far from perfect—some people who are listed are not spammers, and many spammers are not listed at all. As a result, DNSBLs are rather blunt instruments, to be used with care. Because of the proliferation of spam, however, many feel that the use of such blacklists is absolutely essential. Before using a blacklist, however, make sure that its policies regarding what senders get listed make sense for your server.

Checking for SMTP Violations

Another technique for quickly identifying spam is to categorize a given computer as a spammer based on its disregard for SMTP protocol details. For example, SMTP clients are required to wait until the server greets them to begin sending data. Spammers, however, are frequently in a hurry, and most servers do not object to receiving incoming email all at once. For this reason, spammers often begin transmitting data as quickly as possible without waiting for the server's participation. It is easy for a server to wait a second before greeting the client, just to see if the client will wait to be greeted. Not waiting for a greeting is an easily detected violation of the SMTP protocol common to many spammers.

Spammers often attempt to gain anonymity by sending email through someone else's computer, often exploiting unusual software in the process. For example, unsecured HTTP proxies can hide the spammer's identity, but send HTTP commands in addition to SMTP commands. In order for such subterfuge to work, the spammer relies on most servers ignoring the HTTP commands and accepting the SMTP commands. However, the presence of both types of commands is an easy way to identify a sender who is doing something improper and, therefore, is a spammer.

The popular greylisting technique also fits into this spammer-identifying category. Greylisting is a technique of validating sending clients by temporarily refusing to accept messages from unrecognized clients. Once an initial attempt is made and fails with a temporary error code, that sender is added to the list of recognized senders. Further delivery attempts are accepted. When this happens to a typical email, the sending server waits and re-attempts delivery later. Spammers, however, often do not care whether their messages are delivered and do not re-attempt later, identifying themselves as spammers. This technique can be altered to require specific messages to be retried rather than specific clients, but the fundamental concept is the same.

Monitoring protocol violations, however, is obviously not a full solution; spammers only need to use software that fully implements SMTP and they will get through. The reason this is still a successful method of detecting spammers is that spammers find it inconvenient to obey all of the SMTP requirements, and most recipient servers are not picky about protocol violations.

Pattern Matching

Once a message is transmitted, heavyweight scanners can analyze it in depth. Looking for simple patterns in an email's text is often surprisingly effective. For example, emails naming obscure pharmaceuticals or beginning with the phrase "Dear friend", or those obfuscating English words with misspellings and numbers (for example, "pr0n") are often spam. On the other hand, these patterns are also in messages that discuss spam, and can mistakenly identify a message as spam when it is not.

Heavyweight

While lightweight spam identification techniques are often effective, their simplicity leaves them vulnerable to avoidance or compensation techniques implemented by spammers. Heavyweight spam identification, on the other hand, is designed to be more robust and to adapt as spammers change their tactics.

Bayesian and other Machine-Learning Techniques

A relatively recent development in spam fighting has been the popularization of Bayesian networks to identify spam, beginning in 2002 (first proposed in 1998). Bayesian networks work in terms of probability: each word or phrase in a message has a certain probability of identifying the message as spam. The more "probably spam" words in a message, the more likely the message is spam. Similarly, the more "probably not spam" words in a message; the less likely the message is spam. What makes this technique particularly interesting is its ability to use email examples provided by the user to discover important words and phrases and to fine-tune its identifying power (probability) over time. In this way, the filter learns what spam looks like and adapts as spam changes. Messages from family members, for example, can be analyzed by the Bayesian classifier to discover what specific names, words, and/or phrases are less likely to be used in spam, and what rarely used words or nonsensical word pairs are more likely to be used in spam. There are many examples of scanners that work this way, including:

Bayesian classification is a simple method of machine learning, and is susceptible to obfuscation of the content of the email. However, the idea of applying machine-learning techniques to the problem of spam identification is a powerful one. Several software projects are available that use more advanced mathematical models than the basic Bayesian model. For example, the CRM114 (http://crm114.sourceforge.net/) filter organizes words and phrases into hidden Markov models rather than Bayesian networks, and the DSPAM (http://dspam.nuclearelephant.com/) software adds neural networks and several advanced enhancements to standard Bayesian learning and classification techniques.

Ensemble Identification

One of the most popular forms of heavyweight analysis is the ensemble analysis; also referred to as the toolkit or arsenal approach. Rather than analyzing messages using a particular method, messages are examined with multiple methods, and the outputs of these methods are considered recommendations rather than authoritative decisions. These recommendations are combined with a weighted scoring system that allows techniques with low effectiveness to be taken into consideration. Messages with scores over a certain threshold are then considered spam, while messages under that threshold are not. One of the most popular examples of this type of spam identification technique is SpamAssassin (http://spamassassin.apache.org/).

SpamAssassin uses a large library of pattern-based filters, consults DNSBLs and several online spam databases, and also includes a Bayesian analyzer. The benefit of this approach is that new ideas in identifying spam can be added to the arsenal as they are developed. The catch, however, is that the software is very complicated, and has more overhead than simpler or more straightforward classification techniques because it uses all of the identification techniques rather than just one of them. Additionally, the expected level of accuracy must weigh the influence of the techniques used. Getting the right balance between rules is difficult to gauge in the general case. SpamAssassin, for example, carefully tailors the relative weights of rules in its collection to be accurate over a large set of unrelated emails.

Quarantines and Challenges

Perhaps the most involved filters are the kind that hold email messages hostage while a human decides whether they are spam (and should be deleted) or are ham (and should be delivered). These types of filters fall into two categories: quarantines and challenges. With a quarantine the recipient decides whether a message is spam or not, while with a challenge, the sender must certify a message is not spam. The more common of the two is quarantine, usually combined with heavyweight scoring. Because heavyweight identification typically places email somewhere on a continuum between spam and ham, messages not considered ham are often delivered to a special folder for holding spam. This folder is a quarantine where messages are examined by the recipient, and mistaken categorizations can be identified.

Challenge-based spam identification is less popular, but also effective. One of the most popular forms of challenge-based spam identification is called Tagged Message Delivery Agent (TMDA). When a message is received, it is stored in a holding area while a challenge message is sent back to the sender. The challenge message contains instructions to be followed by the sender so as to identify himself or herself as a non-spammer, such as visiting a message-specific URL or replying to a message-specific email address. When the sender's validating action is performed, the original message is delivered, and in some cases the sender is then added to a list of known good senders so that no challenges are made in the future. This technique is effective against spam, because it requires a valid return address in addition to some of the spammer's precious time. On a usual spammer scale, where millions of messages are sent in an average hour, responding to the TMDA message is not worth the effort. On the other hand, TMDA challenges are frequently effective against ordinary busy people as well. Some people consider challenges to prove that they are not spammers rude or, at the very least, a waste of time. Consider, for example, a person asking an expert a question. The expert might wish to respond, but balk at spending time verifying his non-spammer status. Of course, such situations are easily avoided through judicious use of white lists, but TMDA and other challenge-based methods make such problems easy to overlook.

Mistakes

One of the most critical things to consider when evaluating any anti-spam technique is the problem of mistakes. No technique is perfect, and the questions to answer when evaluating a technique include:

  1. Can false positives (incorrectly tagged or blocked email) be detected?

  2. Can false positives be fixed?

  3. Can false negatives (spam that was not blocked or tagged) be corrected?

  4. Can user errors be fixed?

In the lightweight categories, detecting false positives is problematic because messages identified as spam are often prevented from being delivered. For example, if a domain is mistakenly listed in a DNSBL, all email communication with that domain is cut off, making it difficult for users of that domain to complain that their messages are not being delivered. Often, the only way for domains using the DNSBL to discover the error is for someone from the blocked domain to inform their intended recipient of the problem via a non-email method. This is, of course, only possible if all affected parties have both the desire and ability to use a non-email method of contact. Rather than blocking email that is considered spam, the alternative is to tag it with an identifying header. Email so tagged is delivered to a special spam folder, quarantine area, or something similar. This allows spam-like email to be reviewed for mistakes, but does not take advantage of the primary purpose of lightweight filters: to make quick, final decisions and avoid spending time and resources analyzing each email in depth. Thus tagging email only makes sense when using a heavyweight filter.

Whether mistakes can be fixed is another issue. With learning filters, such as Bayesian classifiers, correcting mistakes and thus improving accuracy is fundamental to the filter's design. On the other hand, informing the filter of its mistakes is frequently inconvenient, particularly with a large group of users who use different mail clients. Some filters, like DSPAM, provide a full web interface for submitting incorrectly tagged email messages, while others rely on command-line access to the mail server or some concoction by the system administrator. A good way to handle such tools in an IMAP-based environment is by creating magic folders that will—either by some server-based hook, cron, or some other method—cause the spam identification system to re-interpret messages placed in them.

On the other hand, correcting errors in a DNSBL or a spam database ranges anywhere from impossible to difficult, depending on the philosophy of those who maintain it. It is often necessary to use a local hand-written white list to avoid or countermand known misidentifications in such lightweight spam prevention techniques.

Stopping Spam from Getting Out

Preventing users from receiving spam is only half of the spam battle. The other half is to avoid sending spam. This may seem like a simple task on a user-by-user basis, but preventing users and your email system from sending spam on a wide-scale basis is more difficult. How to accomplish this depends upon the environment (i.e. how users send email), the resources devoted to the task, and the level of trust and convenience afforded to each user.

An obvious way to address to the problem is to treat outbound email similarly to inbound email. For example, software such as SpamAssassin can scan each email before it is sent, and prevent messages identified as spam from being sent. This, however, is frequently overkill, and is particularly unnecessary if one's users are unlikely to be spammers.

Sender Restrictions

A simplistic approach is to perform basic checks on outbound email, such as ensuring the sender of every email is a valid recipient, limiting the amount of email a sender sends per hour, prohibiting the use of BCC: headers, or something similar. These restrictions, of course, can be onerous, depending on the users being supported, so limits should be chosen carefully.

Bounce-Back Spam

Qmail is often criticized for its default policy of accepting all email destined for domains it considers local and then generating bounce messages for any email whose recipient does not exist. While this is a legitimate practice according to the SMTP protocol, it creates the problem of bounce-back spam (also known as blow-back or back-scatter). The problem stems from the fact that the sender's address might be inaccurate or invalid. When a spammer sends several million messages to random usernames at one of the configured local domains, qmail accepts them all and then generates bounces for all of the non-existent addresses (likely, all of them). Since the sender's address for these emails is probably inaccurate, either qmail is left with a large number of undeliverable bounce messages in its queue or qmail sends bounce messages to whatever legitimate email addresses the spammer chose to use as the return addresses. For example, a spammer can send a message with a return address of [email protected] to [email protected]. The lasdkfkjhqw user likely does not exist at example.com, and if example.com is using an unmodified qmail installation, it will send a bounce message to [email protected] to inform you that the spammer's message could not be delivered.

Recipient Validation

A partial solution to the problem of bounce-back spam is to modify qmail so that it will only accept messages addressed to valid recipients. For example, a modified (or wrapped) qmail can check whether a given user exists in the recipient domain rather than simply checking whether the recipient domain is listed in the control/rcpthosts file. Checking whether a user exists can be a complicated task. For example, mailing lists often use temporary addresses for administrative tasks (like handling user subscription requests) and virtual domains can have external user databases that qmail cannot access easily (such as vpopmail or GNU Mailman domains). As such, there are many different patches to qmail that use different methods of determining whether a user exists or not. Some of the most popular are:

  • Oliver Neubauer's validrcptto patch (http://www3.sympatico.ca/humungusfungus/code/validrcptto.html) considers a user valid if it is listed in the file control/validrcptto. This introduces two complications: first, the user list must be kept up to date, and second, no wild-card addresses are allowed (such as those used by many mailing list software packages, like ezmlm and GNU Mailman).

  • Dr. Erwin Hoffmann wrote, as part of his SPAMCONTROL collection of patches, the RECIPIENTS extension patch (http://www.fehcom.de/qmail/recipients/recipients-044_tgz.bin). This is similar to the validrcptto patch, but relies on a CDB file rather than a text file to list all the valid recipients (making lookups faster, particularly when the list of valid recipients is long). It also accepts all wildcard extension addresses of the addresses listed in its CDB file. In other words, if the address [email protected] is listed, it accepts [email protected] as well. Like the validrcptto patch, the centralized list of users must be kept up to date.

  • Paul Jarc wrote the realrcptto patch (http://multivac.cwru.edu/qmail/) to use the same tests that qmail-send uses to choose a delivery location. This works well when users are defined entirely within qmail's configuration files, or when all delivery locations are accessible by the qmaild user (i.e. the user that runs qmail-smtpd) but does not work well otherwise. For example, this patch accepts all messages for virtual domains controlled entirely by a single .qmail-default file (such as vpopmail domains and GNU Mailman domains) and does not correctly reject messages addressed to recipients that do not exist for those domains (because qmail-send would not reject those messages either; the bounce message normally comes from the vpopmail or Mailman software).

  • Jay Soffian's RCPTCHECK patch (http://www.soffian.org/downloads/qmail/qmail-smtpd-doc.html) relies on a sysadmin-provided external script or program to determine whether a recipient is acceptable or not. In a sense, this is the most flexible approach. It can be made to use any method necessary to validate a user, and allows the script to run as whatever user is necessary to perform the verification without requiring qmail-smtpd to have sufficient privileges to do so itself. However, this flexibility requires the sysadmin to write the script or program to perform the validation, which is more work.

This task does not absolutely require patching qmail; a qmail-queue wrapper can also perform it. Using a wrapper rather than a patch in this case has the drawback that the entire message must be received before the qmail-queue wrapper is triggered. A patch can check the recipients' validity as the sender lists them. If the recipients are invalid, rejecting them earlier saves bandwidth.

None of these methods is a full solution to the problem of bounce-back spam, because none of them can guarantee that a given message is deliverable.

Recipient Validation is Insufficient

User validation is sufficient to prevent bounces in many cases, but some common examples where bounce-back spam cannot be prevented include:

  • The recipient does not have enough room in his or her mailbox to deliver the message. Normally, qmail simply leaves such messages in its queue and keeps retrying delivery until the message is either delivered or it is older than the allowable queue lifetime (defined in control/queuelifetime), normally seven days.

  • The recipient forwards the message elsewhere. If the user's email is forwarded to a non-local account, it might not be deliverable right away, for a variety of reasons that depend on the destination server—the destination might be offline, might have temporary problems, the user might have run out of quota on that system, or any of a number of other problems may occur.

  • The recipient refuses the message in his or her .qmail file. If the user's .qmail file contains something like: |bouncesaying 'go away', then incoming messages are bounced to their return address.

  • The recipient set a vacation message. If the user's .qmail file contains something like: |autorespond 'I am on vacation.', then a message is sent to the return address of any message sent to that user.

  • The recipient may be a mailing list. Messages to mailing lists often generate automated response emails, such as warnings that only subscribers can post or instructions for how to subscribe to the list. Generally accepted best practice for mailing lists allowing anyone to self-subscribe is to require address confirmation. In other words, when someone attempts to subscribe, the list sends a message to the subscribing address with instructions for completing the subscription (this prevents people from subscribing others to lists without their knowledge). Thus, if a spammer sends a message to a mailing list's subscription address, a confirmation email is sent to the return address of the spammer's email.

Though bounce-back spam can never be fully eliminated, validating recipients still dramatically reduces the problem.

One of the downsides of validating recipients, however, is it allows spammers to quickly discover (via the guess-and-check method) what users are valid on such a system, and in future the spammer can direct spam to only those recipients. While this is technically possible, in practice, it is uncommon that spammers spend the time necessary to track the success of each address, because obtaining valid addresses to spam can be achieved with much less time-consuming methods.

Summary

This chapter covered the general topic of expanding the qmail architecture, with particular focus on the details of spam and virus prevention. Many different techniques for addressing the many facets of spam and viruses in today's world were discussed. Armed with this knowledge, a system administrator can harden an email system against spam in ways that are effective, efficient, and appropriate for a system's particular needs, limitations, and available resources. The next chapter covers more advanced topics—SSL support and mailing list support—that rely in different ways upon the understanding of the architecture

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.17.40