Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel - PowerPoint PPT Presentation

Link Spam Detection Based on Mass Estimation Zoltán Gyöngyi , Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen

Roadmap � Search engine spamming � Link spamming � PageRank contribution � Spam mass • Definition • Estimation • Algorithm � Experiments Very Large Data Bases ● Seoul, September 13, 2006 2

Spamming: Example #1 search result for the query “austria ski” Very Large Data Bases ● Seoul, September 13, 2006 3

Spamming: Example #1 search result for the query “austria ski” asiandiveholidays.com asianmp3.com mp3thailand.com thailandhealthcaretimes.com thailandpropertytimes.com Very Large Data Bases ● Seoul, September 13, 2006 4

Spamming: Example Very Large Data Bases ● Seoul, September 13, 2006 5

Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Very Large Data Bases ● Seoul, September 13, 2006 6

Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Very Large Data Bases ● Seoul, September 13, 2006 7

Spamming: Introduction Spamming = misleading search engines to obtain higher-than-deserved ranking Link spamming = building link structures that boost PageRank score Very Large Data Bases ● Seoul, September 13, 2006 8

Spamming: Our Target Detect pages that achieve high PageRank through link spamming s 1 g 1 s 2 s 0 k >> m m s k-1 g m s k Very Large Data Bases ● Seoul, September 13, 2006 9

PageRank Contribution Very Large Data Bases ● Seoul, September 13, 2006 10

PageRank Contribution p 0 Very Large Data Bases ● Seoul, September 13, 2006 11

PageRank Contribution p 0 + = 2 c 2 (1 – c) / n + 2 c (1 – c) / n p 0 – = 6 c 2 (1 – c) / n + c (1 – c) / n p 0 Very Large Data Bases ● Seoul, September 13, 2006 15

Spam Mass: Definition � Absolute mass • Amount (part) of – = 5 a.m. = p 0 PageRank coming from spam 5 p 0 2 � Relative mass • Fraction of PageRank p 0 – 5 r.m. = = p 0 coming from spam 7 • More useful in practice Very Large Data Bases ● Seoul, September 13, 2006 16

Spam Mass: Estimation Ideally… p 0 Very Large Data Bases ● Seoul, September 13, 2006 17

Spam Mass: Estimation In practice… p 0 + � Approximate the set of good nodes by a subset called good core Very Large Data Bases ● Seoul, September 13, 2006 18

Spam Mass: Estimation In practice… – = p 0 – p 0 p 0 + p 0 + � Approximate the set of good nodes by a subset called good core Very Large Data Bases ● Seoul, September 13, 2006 19

Spam Mass: Algorithm 1. Create good core 2. Compute PageRank scores p i and p i + 3. Compute estimated relative mass m i as (p i – p i + ) / p i 4. For all pages i with large PageRank Mark page as spam if m i > threshold Very Large Data Bases ● Seoul, September 13, 2006 20

Experiments: Data � Yahoo! web index � host graph • 73.3M nodes • 979M links � Good core • High-quality web directory: 16,780 • Governmental hosts: 55,320 • Educational hosts: 434,000 Very Large Data Bases ● Seoul, September 13, 2006 21

Experiments: Data � Sample • 0.1% of nodes with PageRank > 10x minimum • 892 nodes • Manually labeled good, spam � Relative mass groups (approx. same size) • Group 1: 44 samples with smallest rel. mass … • Group 20: 40 samples with largest rel. mass Very Large Data Bases ● Seoul, September 13, 2006 22

Experiments: Relative Mass good anomalous 100 10 % 26 % 35 38 80 % 45 % Sample group composition spam % 60 67 % 71 % % 60 83 84 88 90 89 % 91 92 % 93 95 95 % % % % % % 100 % % % 80 40 74 % % 62 59 58 % % 50 % % 40 20 33 % 29 % % 17 16 12 11 10 % 9 % % 8% 7% % % % 5% 5% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sample group number Very Large Data Bases ● Seoul, September 13, 2006 23

Experiments: Relative Mass � Anomalies • *.alibaba.com • *.blogger.com.br • Polish hosts � only 12 .pl in good core Very Large Data Bases ● Seoul, September 13, 2006 24

Experiments: Relative Mass Very Large Data Bases ● Seoul, September 13, 2006 25

Experiments: Core Size 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0 100% core 0.8 10% core 1% core 0.7 Estimated precision 0.1% core .it core 0.6 0.5 0.4 0.98 0.91 0.84 0.76 0.66 0.56 0.45 0.34 0.23 0.1 0 Relative mass threshold Very Large Data Bases ● Seoul, September 13, 2006 26

Related Work � PageRank analyses • [Bianchini+2005], [Langville+2004] � Link spam analyses • [Baeza+2005], [Gyöngyi+2005] � Link spam detection • Statistics: [Fetterly+2004], [Benczúr+2005] • Collusion detection: [Zhang+2004], [Wu+2005] � TrustRank • [Gyöngyi+2004], [Wu+2006] Very Large Data Bases ● Seoul, September 13, 2006 27

Conclusions � Search engine spamming • Manipulation of search engine ranking • Focus on link spamming � Spam mass • ~ PageRank contribution of spam • Useful in link spam detection � Strong experimental results • Virtually 100% of top 47K nodes spam • 94% of top 105K nodes spam Very Large Data Bases ● Seoul, September 13, 2006 28

Link Spamming: Model � Spam farm Very Large Data Bases ● Seoul, September 13, 2006 29

Link Spamming: Model � Spam farm 1.Target node s 0 Very Large Data Bases ● Seoul, September 13, 2006 30

Link Spamming: Model � Spam farm 1.Target node s 1 2.Boosting nodes s 2 s 0 Ski Austria travel… s 3 Great cheap ski Switzerland Italy travel s 4 best rates winter sports hotels Very Large Data Bases ● Seoul, September 13, 2006 31

Link Spamming: Model � Spam farm 1.Target node s 1 2.Boosting nodes g 1 3.Hijacked links from s 2 s 0 good nodes Joe’s Blog s 3 Comments g 2 s 4 Great pictures! See my Austria ski vacation. (by as7869) Very Large Data Bases ● Seoul, September 13, 2006 32

Link Spamming: Model � Spam farm alliances Very Large Data Bases ● Seoul, September 13, 2006 33

PageRank � Probabilistic model: p = c U T p + (1 – c) v • U = U ( T , v ) stochastic transition matrix • |v | = 1 � Linear model: ( I – c T T ) p = (1 – c) v • No adjustment for nodes without outlinks (transition matrix T has all-zero rows) • Advantages – For p = PR( v ) and v = v 1 + v 2 , p = p 1 + p 2 where p 1 = PR( v 1 ) and p 2 = PR( v 2 ) – Faster to compute Very Large Data Bases ● Seoul, September 13, 2006 34

PageRank Contribution � Walk W from x to y: x = x 0 , x 1 , …, x k = y • Weight π (W) = out(x 0 ) –1 · · · out(x k – 1 ) –1 � Contribution of x to y over W: c k π (W) (1 – c) / n x of x to y—over � PageRank contribution p y all walks • Possibly infinite # of walks if there are cycles • p yx = PR(random jump to x only) � See also [Jeh+2003] Very Large Data Bases ● Seoul, September 13, 2006 35

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel - PowerPoint PPT Presentation

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen Roadmap Search engine spamming Link spamming PageRank contribution Spam mass Definition Estimation

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

Quadrupole Mass Filter Ion Trap Mass Filter Ion Cyclotron Resonance Mass Spectrometer Time of

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

MASSES Saturday Vigil 4:30PM Mass in English Sunday 8:00AM Mass in English 9:30AM Mass

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

Email Administra5on Don Porter CSE/ISE 311: Systems Administra5on

Vsevolod Stakhov https://rspamd.com Why rspamd? A real example Rspamd in nutshell Uses

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H.

CS 403X Mobile and Ubiquitous Computing Lecture 15: Making Apps Intelligent/Machine Learning

Combating Snowshoe Spam with Fire Olivier van der Toorn <o.i.vandertoorn@utwente.nl>

Lecture 1.2: Linear independence and spanning sets Matthew Macauley Department of Mathematical

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel - PowerPoint PPT Presentation

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen Roadmap Search engine spamming Link spamming PageRank contribution Spam mass Definition Estimation

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

Quadrupole Mass Filter Ion Trap Mass Filter Ion Cyclotron Resonance Mass Spectrometer Time of

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

MASSES Saturday Vigil 4:30PM Mass in English Sunday 8:00AM Mass in English 9:30AM Mass

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

Email Administra5on Don Porter CSE/ISE 311: Systems Administra5on

Vsevolod Stakhov https://rspamd.com Why rspamd? A real example Rspamd in nutshell Uses

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H.

CS 403X Mobile and Ubiquitous Computing Lecture 15: Making Apps Intelligent/Machine Learning

Combating Snowshoe Spam with Fire Olivier van der Toorn &lt;o.i.vandertoorn@utwente.nl&gt;

Lecture 1.2: Linear independence and spanning sets Matthew Macauley Department of Mathematical

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Combating Snowshoe Spam with Fire Olivier van der Toorn <o.i.vandertoorn@utwente.nl>