High-Speed Detection of Unsolicited Bulk Email Sheng-Ya Lin, Cheng-Chung Tan, Jyh-Charn (Steve) Liu, Computer Science Department, Texas A&M University Michael Oehler National Security Agency Dec, 4, 2007 1
Outline • Motivation • Progressive Email Classifier (PEC) system architecture • Experimental results 2
Email Spamming No Longer Just a Nuisance • Some Facts: – Botnet farms can hit any target (> 10 6 ) – bandwidth waste (3:1 or higher) – Network resource exploit & information stealing (malware planting) – Highly effective hit and run strategy (BGP, DNS, domain name, credit card fraud) • Existing anti-spamming ware – Large number of software copies and signatures to maintain – Comprehensive detection rules, but slow to respond • Signatures management a major bottleneck – Acquisition and the deployment of signatures to numerous machines – A small variation in the known signatures can easily defeat a signature based filter – Spammers can test their designs with anti-spamming ware before starting the (hit and run) campaign 3
Spamming Behavior at a Glance • Spammers do not have full freedom in launching spamming. – Follow the transport protocols to deliver messages – Messages must be perceivable and appealing to human users – Expensive to compose and personalize spamming messages: • interactive (click my URL links) or passive • Low yield rate combined with greed lead to high spamming volumes • Cheap to launch spamming: millions of zombie machines each send a few copies – Any “hit back, interactive” method could cause severe harm to the innocents • Summary – Very difficult for spammers to achieve financial goals without leaving noticeable signatures, i.e. feature instances – A challenge is how to keep up with their speed , volume , and diversity 4
Our Approach • Lossy detection: – focused mainly on the major offenders – Avoid false positive • Timely acquisition of instances of selected features : – Position the detector at the Network Access Points (NAP) • Highest concentration of samples for an enterprise network • Detect them before the flood already enters the network • Work on the algorithm & data structure level, rather than any hardware platform – Broad spectrum of computing resources/constraints • Regular emails are expected to have random distributions of strings that happen to fall into the spamming feature space – Moderated delivery of bulk, legitimate email • A spamming stream: Invariant and variant parts – An invariant that also appears in regular emails cannot be used for filtering – For the first cut effort: URL (over 95% spamming have them) 5
Competitive Aging-Scoring Scheme (CASS) • A spamming invariant (string) is called its feature instance (FI). The essence of our technique: “Extract FIs of emails and keep track of their occurrences. If exceeding a threshold: an UNBE stream” • In a naïve approach, it takes O(1) to update the score of an FI, but O(N) to age all other FIs – A major computing cost • CASS: a constant time algorithm – The time-to-live of an FI is reset each time when its score is increased by one (when a new copy arrives) – The time-to-live of all other FIs is reduced by one – New complexity: O(1) for both scoring and aging – Exceeding a threshold: move it to the blacklist – No further copies in a time-out period: discard it • It may not be a fixed physical time 6
PEC Architecture Feature instance Hash table of Email flow extraction Known strings 32bit Sendmail Hash vs string Berkley DB New Birth& string Death identified Of strings Aging and scoring of unknown strings 7
Data Structure of Scoreboard Entries for feature instances URL1 Address URL1, URL1 Hash Update Scoreboard Table Function H1(URL) 20 bits URL2 (hash_low, score, age_table location) Data Structure of a Cell H2(URL) ++Score Index of SMT 76 bits Exceeds UNBE threshold (S)? Index Miss Count HURL1 0 m 1 m+1 Age Table Data Structure of HURL Δ t HURL2 n-1 32 bits (score_table location) Remove H1(URL).H2(URL) HURL1 0 Exceeds age threshold (M) n m-1 Entries for feature instances 8
A Snapshot Hash URL : (414738(20-bit)+3724(12-bit)) HashURL : (124489(20-bit)+176(12-bit)) Current feature being processed Entry moved to blacklist history Active features MOD queue Arranged Placement in their ages (mod N) newest oldest time The current time location The current time location Queue size = 20 The entry [862 1822] is purged S =10, M =20 9 Next feature instance
Testbed Environment Three Modules included: 1. Email generation 2. PEC (Blacklist and scoreboard): 3. Control and visualization console 10
Experimental Configuration • Email generator: Intel P4-3.0 Windows XP • Email Server: Xeon 3.0GHz, two single core CPUs, Linux, Sendmail 8.14.1 • Within a batch, the sender sends 2000 copies of emails (uniformly mixed UNBEs and regulars). – S: 50 – M: 2048 – The average mail size: 1.5K bytes – One mail per 0.088 seconds on average. 11
Workflow of Email Generation Density Generation Emails SMTP Protocol (bulk/regular) (uniform dist.) U R U U ….. R Feature Dictionary Linux Email Server (Sendmail) MIME Bulk Regular URL structures Bulk Regular Image Src Message Random Text simulation Composer parameters Spamming Keyword selection ` Windows Subject Generation Control Console “From” Generation 12
UNBE Generation • Both UNBE and regular copies are injected with URL links or remote image sources – Can adjust density, locations of variants and invariants in the body of each copy to generate MIME messages. – UNBE features extracted from 2005 TREC Public Spam Corpus, http://plg.uwaterloo.ca/~gvcormac/treccorpus/about.ht ml – Variants: random text taken from web sites – Keywords: User defined (not tested in this report) • The message composer calls an SMTP library to send the generated emails to Sendmail 13
Detection Latency of Single UNBE source •Fix threshold and age table length under different densities. •Test six different UNBE densities (50, 100, 150, 200 …, 300 UNBE messages/bin) Unit: Virtual clock 2500 Experimental Value Expected Value 2000 Detection Latency 1500 1000 500 0 50 100 150 200 250 300 14 Number of messages in a bin
Effects of Multiple UNBE Sources • Given an UNBE source A , six tests were made where one addition UNBE source is added to the experiment at a time. 2500 – The six lines marked as test[1-6] test 1 test 2 test 3 2000 test 4 test 5 • The density (instances/batch) test 6 Detection latency other sources 1500 – A: is fixed at 100 1000 – Other UNBE sources: increased from 50 to 300 500 0 • Result: 50 100 150 200 250 300 Number of messages in a bin for each non-A UNBE – The detection latency of an UNBE decreases with the number of UNBE sources • When a source is captured, it is blocked form the scorebaord. The density measure in VC for others 15 increases
Throughput of URL Parser 30 25 Throughput (1000 Bodys/sec 20 15 10 5 0 1.5K 3.0K 4.5K 6.0K 7.5K Size of Mial Body (K Bytes) The average Email size is from 1.5 KB to 7.5 KB, and each email has 2 URLs. 16
Throughput of Scoreboard and Blacklist •Scoreboard: 1.2M transactions •Blacklist: 0.9M (avg. 30 B) URLs, without including database access 1000 900 800 Throughput ( K URLs/sec 700 600 500 400 300 200 100 0 30 60 90 120 150 17 URL length (bytes)
Pointer Table: reduce memory need (at a small cost of delay) •In the detection window, a limited number of hashed values need to be tracked •Full table for 32-bit hash system takes too much space • Higher order bits used as the index, and the rest, and the rest bits maintained by a linked list (for each entry) •If pointer table uses 20 bits for indexing, that means it has 1M entries, and age table length is 20K~70K, the maximum depth of linked list pointed by pointer table is 2. 18
Threshold Setting • Q1 : “What is the minimum value of M to detect an UNBE attack (of known density) with a success probability of higher than α ?” – (smaller M retire sooner) • Q2 : “For a given M, what is the maximum value of S to guarantee that the probability of the detection latency < ς is greater than α ?” – (large S less likely false positive, but more enter network before detection) 19
Compute M and S λ = μ μ + μ /( ) f b f Γ λ = − α Get M (2, M ) 1 ς = λ E H [ ]/ + S 1 − = α + α + + α + α 2 ( S 1) S E H ( ) 1/ 1/ ... 1/ 2/ + S 1 − − + = − α + α α − S S 1 Get S (1 2 )/( 1) 20
Detection latency when S =24, M =55 21 The prediction model is conservative
Sensitivity of TR vs. S 22
Sensitivity of TR vs. M ( S =24) 23
Summary • PEC demonstrates the feasibility of high speed UNBE filtering at the network vantage points • The method is not meant to replace existing solutions, but to defeat major offenders – 80-20 rule • Expansion of the techniques to handle multiple features (bad words, dirty subnets, black lists, etc) – Integration/interface with existing tools 24
Thank You! 25
Recommend
More recommend