Perl programmer for hire: download my resume (PDF).
John Bokma Fighting spam
freelance Perl programmer

SpamAssassin cookbook

Rule #0: Spam is theft | 1 comment

Below follow some rules and other settings I added to my user_prefs file which is part of the excellent mail classifying Perl program SpamAssassin. Note that I call it a classifying program not a filter nor a spam blocking program. SpamAssassin adds information to an email about its class: spam or ham. Also the score of the email is added as extra information. This can be used to move email to seperate mail folders.

Every time you modify the user_prefs file run spamassassin in lint mode e.g.:

spamassassin --lint

SpamAssassin can also be installed on a computer running a Microsoft Windows (Win32) operating system. Michael Bell has written a HOW-TO on porting SpamAssassin.

Extra classification rules

I use the prefix LR_ to each rule meaning local rule to distinguish the ones that come with SpamAssassin. Each rule consist of three parts. A Perl regular expression defining the actual rule, a description and a score.

It should be noted that the default score used for each rule in SpamAssassin is the result of a long run of a genetic algorithm that assigns each score in such a way that false positives and negatives reported are minimal for a huge set of spam and non-spam (ham). Adding rules and or changing score values can disturb this fine balance and result in more classification errors. Hence the score value I use in each rule is an example only and you should always verify each mail reported as spam before deletion. Don't throw away mail classified as spam without looking at it.

Perl programmers should be aware that they don't accidentally put a terminating ; after the pattern match.

Very long subject lines

header LR_SUBJECT_VERY_LONG    Subject =~ /.{500}/
describe LR_SUBJECT_VERY_LONG  Subject contains a lot of characters
score LR_SUBJECT_VERY_LONG     1.5

This rule is matched if a subject contains at least 500 characters.

Subject starts with my name

header LR_SUBJECT_JOHN    Subject =~ /^john/i
describe LR_SUBJECT_JOHN  Subject: starts with John
score LR_SUBJECT_JOHN     1

This rule is matched if a subject starts with john. The case does not matter thanks to the i option after the closing / in the pattern match.

A lot of spammers start their subject with the part in front of the @ taken from your email address. I have several rules like the above including "postmaster", "hostmaster" and even "ostmaster". I suspect the latter being caused by a flaw in a bot looking for email addresses.

Subject contains %RND_UC_CHAR[2-8]

header LR_SUBJECT_RND    Subject =~ /%RND_UC_CHAR\[2-8\]/
describe LR_SUBJECT_RND  Subject contains %RND_UC_CHAR[2-8]
score LR_SUBJECT_RND     2

I see this one appear now and then. Probably the result of a badly configured spam sending program or a spammer not reading the manual.

Subject starts and ends with a number in brackets

# Example: Subject: (857)Scientificaly proven to work(699)

header LR_SUBJECT_NUMBER_FUN    Subject =~ /^\(\d+\).+\(\d+\)$/
describe LR_SUBJECT_NUMBER_FUN  Subject starts and ends with a number between ()
score LR_SUBJECT_NUMBER_FUN     3

Subject warns about a virus

header LR_VIRUS_IT_1   Subject =~ /^Attenzione: Catturato un Virus$/
describe LR_VIRUS_IT_1  Virus warning
score LR_VIRUS_IT_1     6

Very annoying those automatically generated warnings to innocent bystanders. I am keeping a list of clueless senders of those messages.

URI ends in digits

uri LR_URI_NUMERIC_ENDING       m|^https?://.+?\d{4,}$|i
describe LR_URI_NUMERIC_ENDING  Ends in a number of at least 4 digits
score LR_URI_NUMERIC_ENDING     1

An URI ending in four or more digits matches this rule. Links like that are often used for not working "unsubscribe" links or in tracking systems.

I used the m|...| form instead of /.../ to make the rule more readable. Otherwise the forward slashes have to be escaped using a back slash.

URI has lc as subdomain

uri LR_URI_LC_SUB       m|^https?://lc\.|i
describe LR_URI_LC_SUB  Has lc as subdomain
score LR_URI_LC_SUB     1

The lc subdomain is used now and then in links by spammers. Note that the . must be escaped otherwise it means "any character".

URI contains @castleamber.com

uri LR_URI_AMBER       m|^https?://\S+?=[-_\+a-z0-9\.]+\@castleamber\.com|i
describe LR_URI_AMBER  Uses @castleamber.com in URI
score LR_URI_AMBER     2

A lot of unsubscribe links contain your email address. I own the domain castleamber.com. Create a rule for your own domain. Note that the @ and the . in the domain name must be escaped by putting a \ in front.

Top level domain 'us'

uri LR_US_TLD       /^(?:https?:\/\/|mailto:)[^\/]+\.us(?:\/|$)/i
describe LR_US_TLD  Contains a URL in the US top-level domain
score LR_US         1.5

Since there is a 'biz' rule I decided to add an 'us' rule as well. A similar rule can be made for the 'info' tld.

Fred Mobach (Systemhouse Mobach bv) warned me that there are non-profit organisations and government related organisations that use the 'us' tld so be careful when you add this rule. Bottom line: always check email marked as spam before you delete it.

Re: CAPS, word word word

header __LR_RE_CAPS      Subject =~ /^Re: [A-Z]{2,8}, \S+ \S+ \S+$/
header __LR_MPOP_WEBMAIL X-Mailer =~ /^mPOP Web-Mail 2.19$/
body __LR_LONG_WORDS_5   /([a-z]{6,} ){5,}/
meta LR_EHOSTZZ          (__LR_RE_CAPS && __LR_MPOP_WEBMAIL && __LR_LONG_WORDS_5)
describe LR_EHOSTZZ      e-hostzz spam and look-a-likes
score LR_EHOSTZZ         6

New set of rules for the so called e-hostzz and look-a-likes spam.

Blacklist

I have several email addresses using castleamber.com as domain. Some are no longer in use and some are even invented by spammers. Hence I created the following list. Also some spammers mail spam to one address and CC the same mail to others within the same domain. If one of the blacklisted addresses is used the mail gets a very high score. So it seems useful to create a few bogus addresses to be used as spam traps.

blacklist_to  yse@castleamber.com
blacklist_to  google@castleamber.com
blacklist_to  master@castleamber.com
blacklist_to  ostmaster@castleamber.com
blacklist_to  aster@castleamber.com
blacklist_to  ohn@castleamber.com
blacklist_to  uses@castleamber.com
blacklist_to  contains@castleamber.com

The last two have been taken off this page since I used "contains" as a section title and "uses" in the section itself. The email address harvest bot has dropped the space somehow on both.

Some domain names have very obvious names. The following is an excerpt taken from my user_prefs. You can download a list that I update daily.

blacklist_from  *@advertisingbymail.com
blacklist_from  *@bestbuyuu.com
blacklist_from  *@getquickernews.com
blacklist_from  *@getquicknews.com
blacklist_from  *@myperfectstuff.com
blacklist_from  *@yourgreateststuff.com
blacklist_from  *@greatnewsnow.com
blacklist_from  *@mygreateststuff.com
blacklist_from  *@myquicknews.com
blacklist_from  *@getwebrx.com
blacklist_from  *@purchze3.com
blacklist_from  *@thiergreatstuff.com

blacklist_from  *.greatoffersbymail.com

Note that the * acts as a wildcard that can be put in front of a . too. For example greatoffersbymail.com uses several different subdomains to send spam from. But all end in .greatoffersbymail.com.

I keep a list of Spam sending domains to be used for blacklisting. You can download the list in the 'blacklist_from' format as well.

Further reading

Please post a comment | read 1 comment by Katie C | RSS feed