link spam detection based on mass estimation
play

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel - PowerPoint PPT Presentation

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen Roadmap Search engine spamming Link spamming PageRank contribution Spam mass Definition Estimation


  1. Link Spam Detection Based on Mass Estimation Zoltán Gyöngyi , Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen

  2. Roadmap � Search engine spamming � Link spamming � PageRank contribution � Spam mass • Definition • Estimation • Algorithm � Experiments Very Large Data Bases ● Seoul, September 13, 2006 2

  3. Spamming: Example #1 search result for the query “austria ski” Very Large Data Bases ● Seoul, September 13, 2006 3

  4. Spamming: Example #1 search result for the query “austria ski” asiandiveholidays.com asianmp3.com mp3thailand.com thailandhealthcaretimes.com thailandpropertytimes.com Very Large Data Bases ● Seoul, September 13, 2006 4

  5. Spamming: Example Very Large Data Bases ● Seoul, September 13, 2006 5

  6. Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Very Large Data Bases ● Seoul, September 13, 2006 6

  7. Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Very Large Data Bases ● Seoul, September 13, 2006 7

  8. Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Link spamming = building link structures that boost PageRank score Very Large Data Bases ● Seoul, September 13, 2006 8

  9. Spamming: Our Target Detect pages that achieve high PageRank through link spamming s 1 g 1 s 2 s 0 k >> m m s k-1 g m s k Very Large Data Bases ● Seoul, September 13, 2006 9

  10. PageRank Contribution Very Large Data Bases ● Seoul, September 13, 2006 10

  11. PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 11

  12. PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 12

  13. PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 13

  14. PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 14

  15. PageRank Contribution p 0 + = 2 c 2 (1 – c) / n + 2 c (1 – c) / n p 0 – = 6 c 2 (1 – c) / n + c (1 – c) / n p 0 Very Large Data Bases ● Seoul, September 13, 2006 15

  16. Spam Mass: Definition � Absolute mass • Amount (part) of – = 5 a.m. = p 0 PageRank coming from spam 5 p 0 2 � Relative mass • Fraction of PageRank p 0 – 5 r.m. = = p 0 coming from spam 7 • More useful in practice Very Large Data Bases ● Seoul, September 13, 2006 16

  17. Spam Mass: Estimation Ideally… p 0 Very Large Data Bases ● Seoul, September 13, 2006 17

  18. Spam Mass: Estimation In practice… p 0 + � Approximate the set of good nodes by a subset called good core Very Large Data Bases ● Seoul, September 13, 2006 18

  19. Spam Mass: Estimation In practice… – = p 0 – p 0 p 0 + p 0 + � Approximate the set of good nodes by a subset called good core Very Large Data Bases ● Seoul, September 13, 2006 19

  20. Spam Mass: Algorithm 1. Create good core 2. Compute PageRank scores p i and p i + 3. Compute estimated relative mass m i as (p i – p i + ) / p i 4. For all pages i with large PageRank Mark page as spam if m i > threshold Very Large Data Bases ● Seoul, September 13, 2006 20

  21. Experiments: Data � Yahoo! web index � host graph • 73.3M nodes • 979M links � Good core • High-quality web directory: 16,780 • Governmental hosts: 55,320 • Educational hosts: 434,000 Very Large Data Bases ● Seoul, September 13, 2006 21

  22. Experiments: Data � Sample • 0.1% of nodes with PageRank > 10x minimum • 892 nodes • Manually labeled good, spam � Relative mass groups (approx. same size) • Group 1: 44 samples with smallest rel. mass … • Group 20: 40 samples with largest rel. mass Very Large Data Bases ● Seoul, September 13, 2006 22

  23. Experiments: Relative Mass good anomalous 100 10 % 26 % 35 38 80 % 45 % Sample group composition spam % 60 67 % 71 % % 60 83 84 88 90 89 % 91 92 % 93 95 95 % % % % % % 100 % % % 80 40 74 % % 62 59 58 % % 50 % % 40 20 33 % 29 % % 17 16 12 11 10 % 9 % % 8% 7% % % % 5% 5% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sample group number Very Large Data Bases ● Seoul, September 13, 2006 23

  24. Experiments: Relative Mass � Anomalies • *.alibaba.com • *.blogger.com.br • Polish hosts � only 12 .pl in good core Very Large Data Bases ● Seoul, September 13, 2006 24

  25. Experiments: Relative Mass Very Large Data Bases ● Seoul, September 13, 2006 25

  26. Experiments: Core Size 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0 100% core 0.8 10% core 1% core 0.7 Estimated precision 0.1% core .it core 0.6 0.5 0.4 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0 Relative mass threshold Very Large Data Bases ● Seoul, September 13, 2006 26

  27. Related Work � PageRank analyses • [Bianchini+2005], [Langville+2004] � Link spam analyses • [Baeza+2005], [Gyöngyi+2005] � Link spam detection • Statistics: [Fetterly+2004], [Benczúr+2005] • Collusion detection: [Zhang+2004], [Wu+2005] � TrustRank • [Gyöngyi+2004], [Wu+2006] Very Large Data Bases ● Seoul, September 13, 2006 27

  28. Conclusions � Search engine spamming • Manipulation of search engine ranking • Focus on link spamming � Spam mass • ~ PageRank contribution of spam • Useful in link spam detection � Strong experimental results • Virtually 100% of top 47K nodes spam • 94% of top 105K nodes spam Very Large Data Bases ● Seoul, September 13, 2006 28

  29. Link Spamming: Model � Spam farm Very Large Data Bases ● Seoul, September 13, 2006 29

  30. Link Spamming: Model � Spam farm 1.Target node s 0 Very Large Data Bases ● Seoul, September 13, 2006 30

  31. Link Spamming: Model � Spam farm 1.Target node s 1 2.Boosting nodes s 2 s 0 Ski Austria travel… s 3 Great cheap ski Switzerland Italy travel s 4 best rates winter sports hotels Very Large Data Bases ● Seoul, September 13, 2006 31

  32. Link Spamming: Model � Spam farm 1.Target node s 1 2.Boosting nodes g 1 3.Hijacked links from s 2 s 0 good nodes Joe’s Blog s 3 Comments g 2 s 4 Great pictures! See my Austria ski vacation. (by as7869) Very Large Data Bases ● Seoul, September 13, 2006 32

  33. Link Spamming: Model � Spam farm alliances Very Large Data Bases ● Seoul, September 13, 2006 33

  34. PageRank � Probabilistic model: p = c U T p + (1 – c) v • U = U ( T , v ) stochastic transition matrix • |v | = 1 � Linear model: ( I – c T T ) p = (1 – c) v • No adjustment for nodes without outlinks (transition matrix T has all-zero rows) • Advantages – For p = PR( v ) and v = v 1 + v 2 , p = p 1 + p 2 where p 1 = PR( v 1 ) and p 2 = PR( v 2 ) – Faster to compute Very Large Data Bases ● Seoul, September 13, 2006 34

  35. PageRank Contribution � Walk W from x to y: x = x 0 , x 1 , …, x k = y • Weight π (W) = out(x 0 ) –1 · · · out(x k – 1 ) –1 � Contribution of x to y over W: c k π (W) (1 – c) / n x of x to y—over � PageRank contribution p y all walks • Possibly infinite # of walks if there are cycles • p yx = PR(random jump to x only) � See also [Jeh+2003] Very Large Data Bases ● Seoul, September 13, 2006 35

Recommend


More recommend