Query-log mining for detecting spam queries Carlos Castillo 1 , - PowerPoint PPT Presentation

Query-log mining for detecting spam queries Carlos Castillo 1 , Claudio Corsi 2 , Debora Donato 1 , Paolo Feraggina 2 , Aristides Gionis 1 1 Yahoo! Research Labs, Barcelona, Spain 2 University of Pisa, Italy

motivation Query logs provide valuable information for queries and for documents implicit tags wisdom of crowds Human-constructed directories provide high quality classification labels for (a subset) of douments ⇒ Identify spam by combining information contained in query logs and in web directories and usage mining

main idea Query graphs: bipartite graphs between queries and documents Extract features from query graphs “Semantic” features obtained by propagating web-directory topic labels on the query graph Use obtained features to improve accuracy of spam detection Characterize also queries as spam-attracting

click graph, view graph, and anticlick graph

syntactic features degree of a node (query or document) for document d : topQ x ( d ) the set of queries adjacent to d and being among the fraction x of the most frequent queries in the query log for document d : topT y ( d ) the set of query terms which compose the queries adjacent to d in G and being among the fraction y of the most frequent terms in the query log

topics intuition: multi-topic attractor has potential of being spam topic labels can be obtain from a web directory ...but not for all documents

propagation Read result at each node as a distribution, and compute its entropy

propagation propagation by weighted average score i +1 α i − 1 � score i v ′ ( c ) × f ( v ′ , v ) ( c ) += v ( v ′ , v ) ∈ E and normalization propagation by random walk inspired by topic-sensitive PageRank “Semantic features”: entropy of the distribution of topic scores (documents and queries)

datasets query-log: sample of 1.6m queries from Yahoo! query log web dirctory: DMOZ, 4.2m documents labeled spam colection: the WEBSPAM-UK2006 dataset

statistics on the query graphs Document-level Host-level C d A d V d C h A h V h Queries 1.59M 0.75M 2.78M 1.59M 0.75M 2.78M Docs/hosts 2.75M 1.31M 23.47M 0.83M 0.40M 3.08M Edges 3.69M 1.67M 40.71M 3.50M 1.53M 3.45M C D (0) 0.05 0.08 0.03 0.28 0.35 0.15 C Q (1) 0.18 0.24 0.39 0.58 0.75 0.92 C D (2) 0.22 0.22 0.45 0.70 0.75 0.94 0.32 0.19 0.92 0.80 0.83 0.98 CC max | CC | 0.21 0.23 0.007 0.08 0.06 0.006

finding web spam Feature set Features TP FP F 1 AUC Content ( C ) 98 75.8% 9.8% 0.692 0.912 Links ( L ) 139 84.2% 9.5% 0.739 0.939 Usage ( U ) 61 54.2% 7.4% 0.557 0.872 C ∪ L 237 83.9% 8.6% 0.756 0.952 C ∪ U 159 68.4% 6.6% 0.693 0.917 L ∪ U 200 78.5% 6.5% 0.757 0.951 C ∪ L ∪ U 298 78.9% 6.2% 0.765 0.951

finding spam-attracting queries define “spamicity of a query”: fraction of spam results shown to the user Task 1: predict if query spamicity is “ < 0 . 5” or “ ≥ 0 . 5” AUC: 0.798, true positive rate: 73.7%, false positives: 29.0% Task 1: predict if query spamicity is “= 0 . 5” or “ ≥ 0 . 5” AUC: 0.838, true positive rate: 74.0%, false positives: 22.1%

summary Use query-log mining and DMOZ class labels for spam detection Detect spam that has already “fooled” the search engine Propagation method can be useful in other tasks, too Future: extract better features and improve the results

Thank you!

Query-log mining for detecting spam queries Carlos Castillo 1 , - PowerPoint PPT Presentation

Query-log mining for detecting spam queries Carlos Castillo 1 , Claudio Corsi 2 , Debora Donato 1 , Paolo Feraggina 2 , Aristides Gionis 1 1 Yahoo! Research Labs, Barcelona, Spain 2 University of Pisa, Italy motivation Query logs provide valuable

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

End-to-End Lightpaths ...in the Smallest University of the Netherlands Maurits van der Schee

October 9, 2013 Tonights agenda 6:00 Welcome 6:15 Presentation 6:45 Question and Answer Period

Important Note This presentation is based on the following book References to some of the

Pathlet Routing P. Brighten Godfrey, Igor Ganichev, Scott Shenker, and Ion Stoica

Investor Call THIRD QUARTER 2020 October 21, 2020 Time: 8:30 AM CDT Webcast: www.pnfp.com

Board of Visitors Finance Committee Meeting June 2016 Finance Committee Agenda Consent Agenda:

Data-driven growth : From finding product-market fit to scaling Sep 13, 2016 Agenda:

Crowdsourcing semantic data management: challenges and opportunities Elena Simperl Karlsruhe

Query-log mining for detecting spam queries Carlos Castillo 1 , - PowerPoint PPT Presentation

Query-log mining for detecting spam queries Carlos Castillo 1 , Claudio Corsi 2 , Debora Donato 1 , Paolo Feraggina 2 , Aristides Gionis 1 1 Yahoo! Research Labs, Barcelona, Spain 2 University of Pisa, Italy motivation Query logs provide valuable

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

End-to-End Lightpaths ...in the Smallest University of the Netherlands Maurits van der Schee

October 9, 2013 Tonights agenda 6:00 Welcome 6:15 Presentation 6:45 Question and Answer Period

Important Note This presentation is based on the following book References to some of the

Pathlet Routing P. Brighten Godfrey, Igor Ganichev, Scott Shenker, and Ion Stoica

Investor Call THIRD QUARTER 2020 October 21, 2020 Time: 8:30 AM CDT Webcast: www.pnfp.com

Board of Visitors Finance Committee Meeting June 2016 Finance Committee Agenda Consent Agenda:

Data-driven growth : From finding product-market fit to scaling Sep 13, 2016 Agenda:

Crowdsourcing semantic data management: challenges and opportunities Elena Simperl Karlsruhe

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All