BotGraph: Large Scale Spamming Botnet Detec5on Yao Zhao Yinglian Xie * , Fang Yu * , Qifa Ke * , Yuan Yu * , Yan Chen and Eliot Gillum ‡ EECS Department, Northwestern University MicrosoK Research Silicon Valley * MicrosoK Coopera5on ‡ 1
Web‐Account Abuse ARack Zombie Spammer’s (Compromised host) Server User/Pwd Captcha solver RDSXXTD3 2
Problems and Challenges • Detect Web‐account Abuse with Hotmail Logs – Input: user ac5vity traces (signup, login, email‐sending records) – Goal: stop aggressive account signup, limit outgoing spam • Algorithmic challenge: – ARack is stealthy: individual account detec5on difficult – ARack is large scale: finding correlated ac5vi5es – Low false posi5ve and false nega5ve rate • Engineering challenge: – Large user popula5on: >500 million accounts – Large data volume: 300GB‐400GB data per month 3
The BotGraph System • A graph‐based approach to a@ack detecBon – A large user‐user graph to capture bot‐account correla5ons – Iden5fy 26M bot‐accounts with a low false posi5ve rate in two months • Efficient implementaBon using Dryad/DryadLINQ – Graph construc5on/analysis is not easily parallelizable – hundreds of millions of nodes, hundreds of billions of edges – Process 200GB‐300GB data in 1.5 hours with a 240‐machine cluster The first to provide a systemaBc soluBon to the new a@ack 4
System Architecture 1. History based algorithm to detect aggressive signups EWMA based change detection Aggressive Signup Signup signups botnets data Verification (ID, IP, time) & prune Sendmail (ID, time, # of recipients) data 2. Graph-based algorithm to find correlations Verification & prune Random graph Graph (ID, IP, time) generation based clustering Login Spamming Suspicious Login graph botnets clusters data 3. Parallel algorithm on 5 DryadLINQ clusters
Detect Aggressive Signups Large 25 predic5on Number of Signup Accounts Signup Count error 20 EWMA Prediction 15 Back to normal 10 5 Date 1-Jul 2-Jul 3-Jul 4-Jul 5-Jul 6-Jul 7-Jul 8-Jul 9-Jul • Simple and efficient • Detect 20 million malicious accounts in 2 months 6
System Architecture 1. History based algorithm on Signup detection EWMA based change detection Aggressive Signup Signup signups botnets data Verification (ID, IP, time) & prune Sendmail (ID, time, # of recipients) data 2. Graph-based algorithm on login detection Verification & prune Random graph Graph (ID, IP, time) generation based clustering Login Spamming Suspicious Login graph botnets clusters data 3. Parallelel Algorithm 7 on DryadLinq clusters
Detect Stealthy Accounts by Graphs • Observa5on: bot‐accounts work collabora5vely A user‐user graph to model behavior similariBes • Normal Users – Share IP addresses in one AS with DHCP assignment • Bot‐users 8
Detect Stealthy Accounts by Graphs • Observa5on: bot‐accounts work collabora5vely A user‐user graph to model behavior similariBes • Normal Users – Share IP addresses in one AS with DHCP assignment • Bot‐users – Likely to share different IPs across ASes 9
User‐user Graph User3 • Node: Hotmail account 2 ASes User1 • Edge weight: # of ASes of the shared IP addresses 4 ASes 5 ASes – Consider edges with weight>1 3 ASes User4 • Key Observa5ons User2 – Bot‐users form a giant connected‐component while User5 normal users do not 1 AS – Interpreted by the random User6 graph theory 10
Random Graph Theory • Random Graph G ( n , p ) – n nodes and each pair of nodes has an edge with probability p and average degree d = ( n ‐1) ∙ p • Theorem – If d < 1 , then with high probability the largest component in the graph has size less than O(log n ) No large connected subgraph – If d > 1, with high probability the graph will contain a giant component with size at the order of O( n ) Most nodes are in one connected subgraph 11
Graph‐based Bot‐user Detec5on • Step 1: detect giant connected‐components from the user‐user graph • Step 2: hierarchical algorithm to iden5fy the correct groupings – Different bot‐user groups may be mixed – Difficult to choose a fixed edge‐threshold – Easier valida5on with correct group sta5s5cs • Step 3: prune normal‐user groups – Due to na5onal proxies, cell phone users, facebook applica5ons, etc. 12
Hierarchical Bot‐Group Extrac5on G T=2 1st group 3rd group A B T=3 C D T=4 2nd E group 13
System Architecture 1. History based algorithm on Signup detection EWMA based change detection Aggressive Signup Signup signups botnets data Verification (ID, IP, time) & prune Sendmail (ID, time, # of recipients) data 2. Graph-based algorithm on login detection Verification & prune Random graph Graph (ID, IP, time) generation based clustering Login Spamming Suspicious Login graph botnets clusters data 3. Parallelel Algorithm 14 on DryadLINQ clusters
Parallel Implementa5on on DryadLINQ • EWMA‐based Signup Abuse Detec5on – Par55on data by IP – Can achieve real‐Bme detecBon • User‐User Graph Construc5on – Two algorithms and op5miza5ons – Process 200GB‐300GB data in 1.5 hours with 240 machines • Connected Component Extrac5on – Divide and conquer – Process a graph of 8.6 billion edges in 7 minutes
Graph Construc5on 1: Simple Data Parallelism � • Poten5al Edges – Select ID group by IP (Map) – Generate poten5al edges ( ID i , ID j , IP k ) (Reduce) • Edge Weights – Select IP group by ID pair (Map) – Calculate edge weight (Reduce) • Problem – Weight 1 edge is two orders of magnitude more than others – Their computaBon/communicaBon is unnecessary �
Graph Construc5on 2: Selec5ve Filtering 17
Comparison of Two Algorithms • Method 1 – Simple and scalable • Method 2 – Op5mized to filter out weight 1 edges – U5lize Join func5onality, data compression and broadcast op5miza5on 18
Detec5on Results • Data descrip5on – Two datasets • Jun 2007 and Jan 2008 – Three types of data • Signup log (IP, ID, Time) • Login log (IP, ID, Time) – 500M users and 200~300GB data per month • Sendmail log (ID, 5me, # of recipients) 19
Detec5on of Signup Abuse 20
Detec5on by User‐user Graph 21
Valida5ons • Manual Check – Sampled groups verified by the Hotmail team – Almost no false posi5ves • Comparison with Known Spamming Users – Detect 86% of complained accounts – Up to 54% of detected accounts are our new findings • Email Sending Sizes per Group – Most groups have a sharp peak – The remaining contain several peaks • False Posi5ve Es5ma5on – Naming paRern (0.44%) – Signup 5me (0.13%) 22
Possible to Evade BotGraph? • Evade signup detec5on: Be stealthy • Evade graph‐based detec5on – Fixed IP/AS binding • Low u5liza5on rate • Bot‐accounts bound to one host are easy to be grouped – Be stealthy (sending as few emails as normal user) Severely limit a@ackers’ spam throughput 23
Conclusions • A graph‐based approach to a@ack detecBon – Iden5fy 26M bot‐accounts with a low false posi5ve rate in two months • Efficient implementaBon using Dryad/DryadLINQ – Process 200GB‐300GB data in 1.5 hours with a 240‐ machine cluster Large‐scale data‐mining for network security is effecBve and pracBcal 24
Q & A? Thanks! 25
Recommend
More recommend