new and upcoming features in spamassassin v3
play

New and upcoming features in SpamAssassin v3 ApacheCon 2004 - PowerPoint PPT Presentation

New and upcoming features in SpamAssassin v3 ApacheCon 2004 November 15, 2004 By: Theo Van Dinter Project Changes Became an ASF Top Level Project New logo! License change from GPL/PAL to ASL2 Took 4+ months for 100+ retroactive CLAs Moved


  1. New and upcoming features in SpamAssassin v3 ApacheCon 2004 November 15, 2004 By: Theo Van Dinter

  2. Project Changes Became an ASF Top Level Project New logo! License change from GPL/PAL to ASL2 Took 4+ months for 100+ retroactive CLAs Moved from SourceForge to ASF Mailing lists CVS to Subversion 2

  3. Project Changes New version number scheme (x.y.z vs x.yz) Minimum perl version increased from 5.005 to 5.6.1 Major API Changes Code cleanup Message Parsing Module merging and separation 3

  4. Project Changes 2.60 vs 3.0.0 1712 commits 941k vs 1.0m (gzip release file) 1 year exactly between releases 9 months in 2.70 & 3.0.0 development 2 months in pre-release mode (scores and testing) 1 month in release candidate mode (beta testing) 4

  5. Changes per Type Code is in multiple pieces: Message filter Read in message, parse, rewrite output Rule engine Run hundreds of rules over message contents, handle priorities, scoring (weight) per rule, etc. 5

  6. Filter Changes 6

  7. Message Parser Core of the filter is the message parser v2 did OK, but complex MIME didn’t work at all Removed Mail::Audit support and NoMailAudit module, replaced with Message and Message:: Node Ground-up rewrite of parser goal to handle even complex MIME messages better emulation of common MUA behavior 7

  8. Message Parser Linear vs Recursive New internal tree structure Just-In-Time (JIT) behavior when possible 8

  9. Message Parser MUA emulation OE HTML heuristic Content-Type boundary handling Non-RFC compliance 9

  10. Filter Changes Configuration parser separated from options Makes handling of option values standardized Parsing is much faster Hash lookup, not linear if-then-else logic Configuration files can now include other files 10

  11. Filter Changes: ArchiveIterator Added support for UW mbx format Now handles: file, mbox, mbx, dir spamassassin script now uses ArchiveIterator MUCH faster for batch operations Message parser JIT behavior makes remove markup mode super fast! 11

  12. ArchiveIterator Batch Mode Example $ ls -la 1000spam -rw-r--r-- 1 felicity fame 4375748 Oct 10 18:15 1000spam $ time formail -s spamassassin-260 -L < 1000spam > 1000spam.spam 900.550u 71.580s 17:06.06 94.7% 0+0k 0+0io 724863pf+0w $ time formail -s spamassassin-260 -d < 1000spam.spam > 1000spam.clean 706.440u 55.200s 13:21.33 95.0% 0+0k 0+0io 722119pf+0w diff reported 51 differences, all Subject header whitespace related $ time spamassassin-30 -L --mbox 1000spam > 1000spam.spam 69.700u 0.600s 1:13.44 95.7% 0+0k 0+0io 730pf+0w $ time spamassassin-30 -d --mbox 1000spam.spam > 1000spam.clean 3.360u 0.290s 0:03.66 99.7% 0+0k 0+0io 726pf+0w diff reported 0 differences; scanning: 14x faster, removing markup: 209x ! 12

  13. Changes to spamd spamd is daemonized spamassassin scanner Previous versions accepted message, then forked to process 3.0.0 pre-forks children who “randomly” accept connections and do processing Causes lots of challenges, such as reverting user configuration, GC, resource usage, etc. 13

  14. Changes to spamd Output log includes mass-check compatible output: Oct 10 09:34:57 eclectic spamd[14215]: result: Y 14 - ADDRESS_IN_SUBJECT,BAYES_99,DNS_FROM_AHBL_RHSBL,EXCUSE_1, EXCUSE_3,EXCUSE_7,HTML_90_100,HTML_IMAGE_RATIO_02, HTML_MESSAGE,MARKETING_PARTNERS,MIME_QP_LONG_LINE, MPART_ALT_DIFF,RAZOR2_CHECK,RCVD_IN_SBL,URIBL_OB_SURBL, URIBL_SBL,URIBL_WS_SURBL scantime=2.1,size=6896, mid=<61DF7FD0F6D44F26B764B5C2CE4C9ECFA6D1E8@anbok.com>, bayes=0.999669884503962,autolearn=no 14

  15. Engine Changes 15

  16. Rule Changes 2.60 had 872 rules, 3.0.0 has 628 407 kept, 465 removed, 221 added, includes renamed rules 2.60 had 160 sub-rules, 3.0.0 has 227 130 kept, 30 removed, 97 added 16

  17. New Rules RCVD_IN_XBL Spamhaus exploit list DRUGS_* Common drug references LONGWORDS Lots of 5+ letter words in a row 17

  18. New Rules HTML_BACKHAIR_* Catches HTML obfuscation techniques: up to 8<b></b>0% </strong> by purch<b></b>asing onl<b></b>ine for ac</amount>cess to mi<grab>llions of pr<clergy>ivate, sen<hem>sitive <cab>online re</maxwellian>cords,<br> Free<kkmx7fb1lxwk0p1> O<ku0j5aa3xhln6z1<k9lntxebsm7452>>nl<k8sk2493yb31md1>ine Consult<br> Order pr<kn10yxtomj0e82>escrip<k0602x82qzft>tion onl<kv5mh0x2lq1npz>ine and <k3eh16dp3swwg1e>Cheap<br> 18

  19. New Rules MPART_ALT_DIFF Looks for multipart/alternative messages with significantly different word lists in text/plain and text/html parts ------=_NextPart_000_00AM_08K3791OO_07L.777L91H0 Content-Type: text/plain Get a capable html e-mailer ------=_NextPart_000_00AM_08K3791OO_07L.777L91H0 Content-Type: text/html Buy my pills and mortgage! ------=_NextPart_000_00AM_08K3791OO_07L.777L91H0-- 19

  20. New Rules: Spammers make it easy? MSGID_SPAM_CAPS Message-ID header is in /^[A-Z]+@/ format Catches 11.3% of spam, no FPs Message-ID: <SXEXBAZDNVTGYMYBTRKUWOSQ@finklfan.com> Message-ID: <EKGSGWAIBTGZTSHZZBBZ@yahoo.com> Message-ID: <CMIVFJJHOPNXVBXUUP@hoardermail.com> Message-ID: <HAWZFYXQLDVBHGKSSMVDS@t-online.de> Message-ID: <YYIMPKBREVIVSFCLRKKBFI@webtv.com> 20

  21. New Rules: Spammers make it easy? RCVD_DOUBLE_IP_SPAM Received header is fake with two IPs listed Catches 12.5% of spam, no FPs Received: from [119.227.62.1] by 64.142.3.173 with ESMTP id <110617-93232>; Fri, 27 Aug 2004 23:59:33 +0300 Received: from 110.56.100.200 by 211.190.241.62; Sun, 10 Oct 2004 10:03:35 +0600 21

  22. New Rules: Spammers make it easy? X_MESSAGE_INFO X-Message-Info header exists... Catches 18.0% of spam, no FPs X-Message-Info: 7wCUko664gJL/isOpbpHZpUXeysrI7Ea X-Message-Info: TBEqiuUDX224aiZQ59TCWxBY0AToUL99HSW7V9gnf576J X-Message-Info: 5%RNDLCCHAR37%RNDDIGIT15iI/zPMjruQBFrbQUxdR2AManr X-Message-Info: %RNDUCCHAR15c%RNDUCCHAR1548fspGLBoaq%RNDUCCHAR16opvCRRkfnGFQoxl3 22

  23. New Rules: Spammers make it easy? RCVD_HELO_IP_MISMATCH Received header indicates sender used IP for HELO, but it doesn’t match the sender’s IP 25.7% of spam, 0.03% FPs, all misconfigured MTAs Received: from 65.214.43.12 (unknown [211.222.252.28]) by bblisa.bblisa.org (Postfix) with SMTP id DD6DE1768DB for <felicity@kluge.net>; Sat, 11 Sep 2004 00:38:56 -0400 (EDT) Received: from 64.142.3.173 (unknown [219.248.62.167]) by bugzilla.spamassassin.org (Postfix) with SMTP id CFD6C83899 for <felicity@kluge.net>; Wed, 13 Oct 2004 22:06:28 -0700 (PDT) Received: from 66.92.69.221 (unknown [211.217.181.250]) by eclectic.kluge.net (Postfix) with SMTP id BECCD444550 for <felicity@kluge.net>; Thu, 14 Oct 2004 00:53:37 -0400 (EDT) 23

  24. New Rules: Spammers make it easy? MIME_BOUND_DD_DIGITS MIME boundary is simply /^--[0-9]+/ Catches 36.5% of spam, no FPs Content-Type: text/html; boundary="--5050984427071928258" Content-Type: text/plain; boundary="--5895368826571874203" Content-Type: multipart/alternative; boundary="--2396152152574698241" Content-Type: multipart/mixed; boundary="--44188425536568249" Content-Type: multipart/related; boundary="--610294112918606" 24

  25. Rules AutoWhiteList (AWL) now on by default Partially due to change from commandline option to configuration parameter Mainly because the idea and code are mature and work fairly well AWL tracks From address, sending IP network, and average message scores over time, moves future mail scores towards the average felicity@kluge.net|ip=66.92 => # of messages received felicity@kluge.net|ip=66.92|totscore => total score of messages received 25

  26. Bayes Changes Storage backend now has “plugin” capability Berkeley DB (BDB) is default, added SQL in v3 Added capability to backup & restore Good for backup and recovery, modifying stored values, converting between storage backends, etc. Added flock locking option for all SA DB access Tokens are now stored as hash values, not raw 26

  27. Bayes in SQL Supports MySQL and PostgreSQL natively Lots of benefits, generally faster overall Scanning, 3-30% faster, depending on # of tokens Learning, 2-3x slower, requires multiple SQL commands per update Expiry, 6-7x faster, BDB does lots of I/O, etc. For more information, see Michael Parker’s presentation following this one! 27

  28. Out with the GA! Replaced Genetic Algorithm (GA) for score generation with Perceptron Learner No one wanted to deal with the GA code Did anyone understand the code? Not really. Most time spent kluging around glue scripts Perceptron is much, much , faster GA took 6-24 hours/scoreset for 2.5 and 2.6 Perceptron took 8 minutes/scoreset for 3.0.0 28

  29. General Perceptron ACCEPT_CREDIT_CARDS w 1 BAYES_99 w 36 HTML_80_90 Σ w 289 IMPOTENCE w 399 URIBL_WS_SURBL w 724 YOU_WON w 770 Sigmoid Gain Input Layer Weights Sigma Node Function Per message, input is bit array of rules which were hit Multiply input bits by respective rule weights, and sum Squash result into 0-1 range (ham vs spam) Modify weights so result approaches desired value At end, weights become scores via linear transformation 29

Recommend


More recommend