Web Spam Challenges Carlos Castillo Yahoo! Research - PowerPoint PPT Presentation

Web Spam Challenges Carlos Castillo Yahoo! Research chato@yahoo-inc.com

WEBSPAM\-UK200[67] ● Got a crawl from Boldi/Vigna/Santini in 2006Q3 ● Wrote to 20-30 people to ask for volunteers – Most said yes, and most of them didn't defect ● Created an interface for labeling – 3-4 days of work to get it right ● Labeled a few thousand elements together ● Then, did basically the same again in 2007Q3 Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Why is it good to do collaborative labeling? ● The labeling reflects some degree of consensus – After-the-fact methodological discussions can be very distracting, and actually there was none of it ● Webmasters do not harass you – Responsibility is shared and furthermore you tell search engines not to use these labels ● Labellers get insights about the problem Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Why is it bad to do collaborative labeling? ● In this particular problem, it is very expensive – You get more labels for less money if you just pay for the labels to MTs Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Lessons (learned?) ● Would do WEBSPAM-UK2006 waiting for the 2 nd or 3 rd crawl instead of using the 1 st one ● Would try to raise money for WEBSPAM-UK2007 and do it with MTs – If the money were enough, try to go for a larger collection Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Web Spam Challenges ● The good – Saved a lot of processing to participants, thus ... – Got several submissions with diverse approaches – Baseline was strong but not too much – Side-effect: good dataset for learning on graphs ● The bad – Train/test splits at host level (I, fixed in III) – Snowball sampling (II, fixed in III) – Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Lessons (learned?) ● Would do mostly the same ● Avoid the mistakes ● Promote much more the competition – Try to appeal to a wider audience – Get sponsorship for a prize – Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

What is the point of all this? ● Remove a roadblock for researchers working on a topic ● Encourage multiple approaches to a certain problem – in parallel ● Keep web data flowing into universities ● Allow repeatability of the results Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

So, if a new dataset+challenge appears ● It has to be a good problem: novel, difficult and with a range of potential applications ● Why? Because if we are going to encourage many information-retrieval researchers to work on this problem, there has to be a large treasure chest to split at the end Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Good signals to look for ● “The dataset for X removed a roadblock towards a complex information-retrieval problem for which no other dataset existed” ● “Research about X was only done inside companies before” ● “Problem X was increasingly threatening Web- based work/search/collaboration/etc.” ● Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Ideas (1/4) ● Disruptive or non-cooperative behaviour in – peer-production sites ● Examples: review/opinion/tag/tag-as-vote/vote spam ● Adversary: wants to promote his agenda/business – social networks ● Examples: find fake users ● Adversary: wants to be seen as multiple independent people ● Examples: find users that are too aggressive on promoting their own stuff? most social networks have norms against it (wikipedia/kuro5hin/digg/etc.) ● Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Ideas (2/4) ● Plagiarism or missing attribution – Web-scale automatic identification of sources for the statements on a document – Adversary: wants to make his posting look original ● Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Ideas (3/4) ● Checking/joining facts on the Web ● “The capital of Spain is Toledo (Wikipedia: Madrid)” ● “The oil spill from the tanker has killed 500 seals (BBC: 541 seals FOX: 2 anchovies)” – Adversary: wants you to believe something wrong – Related problem: revealing networks of mutually- reinforcing sites pushing a certain agenda ● Aspect of credibility on the Web (there is already a workshop on that) ● Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Ideas (4/4) ● Simpler problem: validating citations – This citation to page P validates the claim it is cited about? where specifically in P? – Adversary: wants to convince you of something that is not supported by the pages he is linking – ● E.g.: someone wants to convince you that Einstein believed in a personal God by quoting him selectively -- but you have access to all his books/letters/etc. Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Summary of proposals ● Non-cooperative behaviour in peer-production networks ● Disruptive usage of social networking sites ● Distortions or falsehoods on the web ● Citations: missing attribution (plagiarism) ● Citations: distorted attribution (invalid citation) Carlos Castillo: “Web Spam Challenges”, AIRWeb 2009.

Web Spam Challenges Carlos Castillo Yahoo! Research - PowerPoint PPT Presentation

Web Spam Challenges Carlos Castillo Yahoo! Research chato@yahoo-inc.com WEBSPAM\-UK200[67] Got a crawl from Boldi/Vigna/Santini in 2006Q3 Wrote to 20-30 people to ask for volunteers Most said yes, and most of them didn't defect

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

spam, ham and other food or how to distribute spam to 100k email addresses Who am I? Debian

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

x 2 > b w T x > 0 SPAM!! x ( x , 1) w 3 x 3 w T x + b ( w , b ) T ( x , 1)

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Advanced Topics of Mail Service Deal with Malicious Mail, including Virus, Phishing, Spam,

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Spam legislation in the Netherlands: the law, results, approach

Ac%ve learning x 2 o o Spam + o o o + o o o + o o o o

We need EVERYONE to ensure Your Spam Filter is set to allow the District domain access to your

Combating Spam Server-side Purpose : to provide insight into the steps an organization can take

Using BGP for realtime import and export of spam whitelist/blacklist entries Peter Hessler

Web Spam, Propaganda and Trust P. Takis Metaxas Computer Science Department Wellesley College

last chance for mail service ? DKIM TFMC2 01/2006 Mail service status More and more spam,

BotMagnifier : Locating Spambots on the Internet Gianluca Stringhini Thorsten Holz Brett

Web Spam Challenges Carlos Castillo Yahoo! Research - PowerPoint PPT Presentation

Web Spam Challenges Carlos Castillo Yahoo! Research chato@yahoo-inc.com WEBSPAM\-UK200[67] Got a crawl from Boldi/Vigna/Santini in 2006Q3 Wrote to 20-30 people to ask for volunteers Most said yes, and most of them didn't defect

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

spam, ham and other food or how to distribute spam to 100k email addresses Who am I? Debian

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

x 2 &gt; b w T x &gt; 0 SPAM!! x ( x , 1) w 3 x 3 w T x + b ( w , b ) T ( x , 1)

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Advanced Topics of Mail Service Deal with Malicious Mail, including Virus, Phishing, Spam,

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Spam legislation in the Netherlands: the law, results, approach

Ac%ve learning x 2 o o Spam + o o o + o o o + o o o o

We need EVERYONE to ensure Your Spam Filter is set to allow the District domain access to your

Combating Spam Server-side Purpose : to provide insight into the steps an organization can take

Using BGP for realtime import and export of spam whitelist/blacklist entries Peter Hessler

Web Spam, Propaganda and Trust P. Takis Metaxas Computer Science Department Wellesley College

last chance for mail service ? DKIM TFMC2 01/2006 Mail service status More and more spam,

BotMagnifier : Locating Spambots on the Internet Gianluca Stringhini Thorsten Holz Brett

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

x 2 > b w T x > 0 SPAM!! x ( x , 1) w 3 x 3 w T x + b ( w , b ) T ( x , 1)