Making fetchmail, Exim and the DCC work together

Spam is the same thing, lots of times.

Abstract

The Distributed Checksum Clearinghouse (DCC) filters out spam by taking checksums of messages passing through email servers and counting how many times a particular checksum has been seen. Checksums which have been seen many times correspond to spam or to popular legitimate email lists.

This document describes how to combine fetchmail, Exim 3.x and the DCC to filter out spam on a Debian Linux system with a small number of users, such as a home Linux box. The same method will probably work for any other fetchmail/Exim based system, too, although the paths to files may differ.

Caveats

This document assumes you are happy downloading, untarring and compiling software, and that you know a little bit about how to configure Exim already.

This system works for me. Test it first before relying on it. I provide this page in the hope that it will be useful, but if the system eats all your email and spits out the pieces, don't come running to me.

If you are running a large email system, you should run your own DCC server and investigate using the DCC as a sendmail milter. Hopefully someone will use the similar features in Exim 4 to incorporate the DCC into that at some point.

Introduction

So, the spammers have finally guessed or harvested enough of my previously well hidden addresses that it's worth using some sort of filtering.

My email configuration uses fetchmail to download email using POP3. fetchmail then talks to Exim, the local mail server, and Exim delivers mail to user accounts. In this document, I'm assuming you already have a working fetchmail/Exim configuration. If you don't, stop reading now and get one. Make sure it works before continuing.

To deal with the spam, I decided to use the Distributed Checksum Clearinghouse or DCC. The DCC works by keeping track of the number of copies of a particular message which are flowing through various mail servers on the internet. A message which has been seen many times is either spam or a popular mailing list. The DCC requires you to white-list your legitimate mailing lists.

It seems to me that this is the most elegant method of filtering spam, since it relies on the one distinguishing feature of spam, namely unsolicited copies of the same thing lots of times (if it was solicited, you'd have whitelisted it when you subscribed, right?) As well as a straight checksum on the message body, the DCC uses "fuzzy" checksums to count how many times a message has been seen, so trivial variations on the same message are counted together.

Getting hold of the DCC client

What you need is the dccproc tarball. dccproc is a DCC client to which you can pipe email to see how many times the DCC has seen it. By default, it outputs the email you pipe to it, with an X-DCC header added. This mail can then be fed back into the mail system, enabling the header to be used for filtering.

Extract the dccproc source using uncompress and tar. You should configure the makefile to install under /usr/local/dcc/ to avoid clashing with the packaging system, by cd'ing to the directory created by the tar file and typing:

./configure --prefix=/usr/local/dcc

Once that's done, it's just make and make install, as usual. You will need to be root for the install step.

If you have a firewall of some description, you will need to open port 6277 for UDP packets. The client will attempt to contact one of the public DCC servers by default.

Test dccproc by piping it an email. The email needs to be the raw mail as found in your mailbox, without any MIME processing. The output should be a copy of the input with a header added. The header will look a bit like this:

X-DCC-wanadoo-be-Metrics: verence 1016; Body=1 Fuz1=1 Fuz2=1

The header shows the "brand" of DCC server (in this case, wanadoo.be's server) and the counts for various checksums (the straight body checksum and the two fuzzy checksums). These counts are how many times the DCC server has seen messages like the one you're looking at.

Configuring fetchmail

Note: If fetchmail is run regularly using the daemon option or from crontab, you should stop this while making the changes to fetchmail and exim.

To your existing fetchmail configuration (usually in the file .fetchmailrc in the directory of the user who runs fetchmail), add the line:

mda "/usr/sbin/exim -oi -oee -oMr fetchmail -f '<%F>' '<%T>'"

For example, my .fetchmailrc looks like:

poll pop3.demon.co.uk with proto sdps
no dns
localdomains verence.demon.co.uk
user "verence" there with password "password" is * here
mda "/usr/sbin/exim -oi -oee -oMr fetchmail -f '<%F>' '<%T>'"
options fetchall

Note: Don't use this .fetchmailrc if you're not using Demon Internet. It won't work, as it specifies the SDPS protocol and disables any DNS lookups.

What this does is cause fetchmail to deliver mail by calling Exim from the command line and piping the mail to it, rather than using SMTP. Doing this lets us use the "-oMr" option which allows us to specify the protocol used to receive the mail (search the spec for "-oMr"). We set this protocol value to "fetchmail". As we'll see below, this means that we can then tell Exim to use the DCC to check only messages coming in via fetchmail, rather than local mail and outbound messages. The DCC should really only be used on non-local mail: there's no point cluttering up the system with checksums from internal mail, and you don't want to accidentally filter common automatic messages, say.

Configuring exim

This configuration is for Exim 3.x, which is the version of Exim supplied with Debian at the time of writing.

I'm assuming you've got a working Exim configuration based on the supplied template.

You need to alter the exim.conf file, /etc/exim/exim.conf. In the main configuration section, add or modify the trusted_users option so it includes the user who runs fetchmail. On my system "paul" runs fetchmail, so:

trusted_users = mail:paul
In the transports section, add the following transport. It doesn't matter where in that section you add it.
# This transport passes messages to the DCC process to see whether
# they are spam.
dcc:
  driver = pipe
  command = "/usr/sbin/exim -oMr dcc -bS"
  transport_filter = "/usr/local/bin/dccproc -f $sender_address -w \
                      /usr/local/dcc/whiteclnt -A  -t $recipients_count"
  user = mail
  group = mail
  log_output
  bsmtp = all
  prefix =
When this transport is used, it filters the mail through the dccproc program and then runs exim again to pass the mail back into the system using batch SMTP. Now we need to use this transport on mail from fetchmail. So, at the top of the directors configuration section (the order does matter here as the directors are checked in order), add:
# This checksums incoming mail using the DCC
checksum:
  driver = smartuser
  transport = dcc
  condition = "${if eq {$received_protocol}{fetchmail}{1}{0}}"
  user = mail

This causes mail received with the "fetchmail" protocol to be fed to the "dcc" transport we just created. As we specified "-oMr fetchmail" in the arguments fetchmail uses to call Exim, Exim will use this director and the "dcc" transport to pass incoming mail through dccproc.

Test all this by sending yourself an email which will be delivered to your external POP3 account. Run fetchmail once, manually. You should end up with an email in your inbox with headers which look a bit like this:
Received: from mail by verence.demon.co.uk with dcc (Exim 3.36 #1 (Debian))
        id 181SY1-00009F-00
	for ...
Received: from paul by verence.demon.co.uk with fetchmail (Exim 3.36 #1
    (Debian))
        id 181SY0-00009A-00
	for ...
Received: from pop3.demon.co.uk
        by localhost with POP3 (fetchmail-5.9.11)
        for ...
     15 Oct 2002 15:19:48 +0100 (BST)
...
X-DCC-wanadoo-be-Metrics: verence 1016; Body=1 Fuz1=1 Fuz2=1

If this doesn't work, you'll need to figure out what's gone wrong. Good luck! The logs from exim in /var/log/exim/mainlog should help.

Actually filtering spam

Using the -c option to dccproc, you can make it exit with a non-zero exit status if the checksum counts exceed specified thresholds. However, this causes the delivery process to stop (the intention is to cause procmail to bounce the message, since dccproc is intended for use with procmail, however, the non-zero exit status causes Exim to defer delivery since it assumes there's a mistake in the configuration). The best thing to do is examine the X-DCC header in an Exim filter file.

If you're like me, you probably use Exim's .forward file filtering language to sort mail from mailing lists into separate mailboxes. You can also use it to delete mail which looks like spam, or save it to a separate folder.

Here's an example section from a .forward file:

#   Exim filter

... Deliver your mail from mailing lists into boxes BEFORE this test...

if $message_headers matches "(?m)^X-DCC-.*-Metrics:(.*(?:\n\\\\s+.*)*)" then
 if $1 contains "many" or ${extract{Body}{$1}{$value}{0}} is above 15 or
    ${extract{Fuz1}{$1}{$value}{0}} is above 15 or
    ${extract{Fuz2}{$1}{$value}{0}} is above 15 then
   save $home/mail/spam
   seen finish
  endif
endif
Note: you must follow the instruction about delivering mail from mailing lists before doing this check: if you don't do this, you may find you start classifying your mailing lists as spam.

This looks for an X-DCC header, allowing for headers which have continuation lines. It pulls out the Body= Fuz1= and Fuz2= parts and checks them against thresholds (after checking to see if any counts are "many", the special value which people use to indicate messages they consider to be spam). We cope with X-DCC headers which do not contain a particular checksum count (because the message is too short) by giving a default value of 0 for counts which are not there.

Spam is delivered to ~/mail/spam. You probably want to change that. If you just want to delete spam, remove the save $home/mail/spam line. However, it's a bad idea to delete the spam mails immediately after you start using this system. First, you should look at the spam folder to see whether you're getting false positives (for example, mailing lists you'd forgotten about). You may also want to adjust the thresholds (35 is what works for me at the moment).

Run a spam trap

If you've got old addresses which get spam and nothing but spam (for example, old posting addresses from Usenet), you can put these to good use as "spam traps", by reporting mail sent to them to the DCC with the special count value of "many". This count value is used to indicate mails which are considered to be definite spam. As a result, other DCC users will be protected from that spam. If one of your spam traps gets the same spam as a legitimate address before the legitimate address sees it, you'll be doing yourself a favour as well, since the DCC will have already marked that message as spam.

The easiest way to do this is to set up a forward file for those spam addresses which pipes mail to dccproc and then discards it. Here's an example I use, which receives mail sent to various addresses which often get spam:

#   Exim filter
# spamtrap, report to DCC and razor
pipe "/usr/local/bin/dccproc -t many -o /dev/null"
pipe "/usr/bin/razor-report"
save $home/mail/spam
seen finish

This also reports to Vipul's Razor. I don't use that because I think the DCC is better (I don't trust Razor's trust mechanism), but there's no harm in giving them a hand. If you've just got the DCC installed, remove the line containing the razor-report command.