Link Spam Detection Based on Mass Estimation Zoltán Gyöngyi , Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen
Roadmap � Search engine spamming � Link spamming � PageRank contribution � Spam mass • Definition • Estimation • Algorithm � Experiments Very Large Data Bases ● Seoul, September 13, 2006 2
Spamming: Example #1 search result for the query “austria ski” Very Large Data Bases ● Seoul, September 13, 2006 3
Spamming: Example #1 search result for the query “austria ski” asiandiveholidays.com asianmp3.com mp3thailand.com thailandhealthcaretimes.com thailandpropertytimes.com Very Large Data Bases ● Seoul, September 13, 2006 4
Spamming: Example Very Large Data Bases ● Seoul, September 13, 2006 5
Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Very Large Data Bases ● Seoul, September 13, 2006 6
Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Very Large Data Bases ● Seoul, September 13, 2006 7
Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Link spamming = building link structures that boost PageRank score Very Large Data Bases ● Seoul, September 13, 2006 8
Spamming: Our Target Detect pages that achieve high PageRank through link spamming s 1 g 1 s 2 s 0 k >> m m s k-1 g m s k Very Large Data Bases ● Seoul, September 13, 2006 9
PageRank Contribution Very Large Data Bases ● Seoul, September 13, 2006 10
PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 11
PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 12
PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 13
PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 14
PageRank Contribution p 0 + = 2 c 2 (1 – c) / n + 2 c (1 – c) / n p 0 – = 6 c 2 (1 – c) / n + c (1 – c) / n p 0 Very Large Data Bases ● Seoul, September 13, 2006 15
Spam Mass: Definition � Absolute mass • Amount (part) of – = 5 a.m. = p 0 PageRank coming from spam 5 p 0 2 � Relative mass • Fraction of PageRank p 0 – 5 r.m. = = p 0 coming from spam 7 • More useful in practice Very Large Data Bases ● Seoul, September 13, 2006 16
Spam Mass: Estimation Ideally… p 0 Very Large Data Bases ● Seoul, September 13, 2006 17
Spam Mass: Estimation In practice… p 0 + � Approximate the set of good nodes by a subset called good core Very Large Data Bases ● Seoul, September 13, 2006 18
Spam Mass: Estimation In practice… – = p 0 – p 0 p 0 + p 0 + � Approximate the set of good nodes by a subset called good core Very Large Data Bases ● Seoul, September 13, 2006 19
Spam Mass: Algorithm 1. Create good core 2. Compute PageRank scores p i and p i + 3. Compute estimated relative mass m i as (p i – p i + ) / p i 4. For all pages i with large PageRank Mark page as spam if m i > threshold Very Large Data Bases ● Seoul, September 13, 2006 20
Experiments: Data � Yahoo! web index � host graph • 73.3M nodes • 979M links � Good core • High-quality web directory: 16,780 • Governmental hosts: 55,320 • Educational hosts: 434,000 Very Large Data Bases ● Seoul, September 13, 2006 21
Experiments: Data � Sample • 0.1% of nodes with PageRank > 10x minimum • 892 nodes • Manually labeled good, spam � Relative mass groups (approx. same size) • Group 1: 44 samples with smallest rel. mass … • Group 20: 40 samples with largest rel. mass Very Large Data Bases ● Seoul, September 13, 2006 22
Experiments: Relative Mass good anomalous 100 10 % 26 % 35 38 80 % 45 % Sample group composition spam % 60 67 % 71 % % 60 83 84 88 90 89 % 91 92 % 93 95 95 % % % % % % 100 % % % 80 40 74 % % 62 59 58 % % 50 % % 40 20 33 % 29 % % 17 16 12 11 10 % 9 % % 8% 7% % % % 5% 5% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sample group number Very Large Data Bases ● Seoul, September 13, 2006 23
Experiments: Relative Mass � Anomalies • *.alibaba.com • *.blogger.com.br • Polish hosts � only 12 .pl in good core Very Large Data Bases ● Seoul, September 13, 2006 24
Experiments: Relative Mass Very Large Data Bases ● Seoul, September 13, 2006 25
Experiments: Core Size 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0 100% core 0.8 10% core 1% core 0.7 Estimated precision 0.1% core .it core 0.6 0.5 0.4 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0 Relative mass threshold Very Large Data Bases ● Seoul, September 13, 2006 26
Related Work � PageRank analyses • [Bianchini+2005], [Langville+2004] � Link spam analyses • [Baeza+2005], [Gyöngyi+2005] � Link spam detection • Statistics: [Fetterly+2004], [Benczúr+2005] • Collusion detection: [Zhang+2004], [Wu+2005] � TrustRank • [Gyöngyi+2004], [Wu+2006] Very Large Data Bases ● Seoul, September 13, 2006 27
Conclusions � Search engine spamming • Manipulation of search engine ranking • Focus on link spamming � Spam mass • ~ PageRank contribution of spam • Useful in link spam detection � Strong experimental results • Virtually 100% of top 47K nodes spam • 94% of top 105K nodes spam Very Large Data Bases ● Seoul, September 13, 2006 28
Link Spamming: Model � Spam farm Very Large Data Bases ● Seoul, September 13, 2006 29
Link Spamming: Model � Spam farm 1.Target node s 0 Very Large Data Bases ● Seoul, September 13, 2006 30
Link Spamming: Model � Spam farm 1.Target node s 1 2.Boosting nodes s 2 s 0 Ski Austria travel… s 3 Great cheap ski Switzerland Italy travel s 4 best rates winter sports hotels Very Large Data Bases ● Seoul, September 13, 2006 31
Link Spamming: Model � Spam farm 1.Target node s 1 2.Boosting nodes g 1 3.Hijacked links from s 2 s 0 good nodes Joe’s Blog s 3 Comments g 2 s 4 Great pictures! See my Austria ski vacation. (by as7869) Very Large Data Bases ● Seoul, September 13, 2006 32
Link Spamming: Model � Spam farm alliances Very Large Data Bases ● Seoul, September 13, 2006 33
PageRank � Probabilistic model: p = c U T p + (1 – c) v • U = U ( T , v ) stochastic transition matrix • |v | = 1 � Linear model: ( I – c T T ) p = (1 – c) v • No adjustment for nodes without outlinks (transition matrix T has all-zero rows) • Advantages – For p = PR( v ) and v = v 1 + v 2 , p = p 1 + p 2 where p 1 = PR( v 1 ) and p 2 = PR( v 2 ) – Faster to compute Very Large Data Bases ● Seoul, September 13, 2006 34
PageRank Contribution � Walk W from x to y: x = x 0 , x 1 , …, x k = y • Weight π (W) = out(x 0 ) –1 · · · out(x k – 1 ) –1 � Contribution of x to y over W: c k π (W) (1 – c) / n x of x to y—over � PageRank contribution p y all walks • Possibly infinite # of walks if there are cycles • p yx = PR(random jump to x only) � See also [Jeh+2003] Very Large Data Bases ● Seoul, September 13, 2006 35
Recommend
More recommend