Linked Latent Dirichlet Allocation in Web Spam Filtering o 1 o 1 - PowerPoint PPT Presentation

Linked Latent Dirichlet Allocation in Web Spam Filtering o 1 o 1 Istv´ an B´ ır´ D´ avid Sikl´ osi J´ acint Szab´ ur 1 Andr´ as A. Bencz´ 1 Data Mining and Web Search Group Computer and Automation Institute Hungarian Academy of Sciences AIRWeb Workshop, April 21, 2009, Madrid, Spain. ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Experiments Conclusion and Future Work Latent Dirichlet Allocation Blei, Ng, Jordan, 2003 fully generative statistical natural language model extension of latent semantic indexing (LSI) has better perplexity than LSI a document is represented as a bag-of-words (no bigrams/trigrams are taken into account) a lot of extensions and variations of LDA were developed and successfully applied ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Experiments Conclusion and Future Work Latent Dirichlet Allocation Model topic: distribution over the words document: distribution over the topics for every word-position of the corpus, draw a topic for that document, and then draw a word for that topic ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Experiments Conclusion and Future Work Latent Dirichlet Allocation In practice given a collection of documents keep only semantic words, delete stopwords, stem create vocabulary choose an appropriate topic-number (about 100) make model inference to create the model for a topic, the word distribution gives a semantic theme for a document, the topic distribution describes to which themes it belongs ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Experiments Conclusion and Future Work Related link based models copycat and the citation influence models (Dietz, Bickel, Scheffer 2007) link-PLSA-LDA and pairwise-link-LDA (Nallapati, Ahmed, Xing, Cohen 2008) They extend LDA over a bipartition of the corpus into citing and cited documents such that influence flows along links from cited to citing documents. Linked LDA is similar to the citation influence model. The main difference: in linked LDA there is no need for a citing and a cited copy of each document. In linked LDA influence may flow along paths of length more than one. ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Experiments Conclusion and Future Work Linked LDA extended LDA model to exploit links between documents beside LDA’s words and topic distribution it involves an additional distribution over the outneighbors ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Experiments Conclusion and Future Work Linked LDA The smoothing parameter vector γ d : γ d ( c ) ∝ the multiplicity of the d → c link � γ d ( c ) = document length / p p is a normalization parameter ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Dataset Experiments Results Conclusion and Future Work Experiments Linked LDA on UK2007-WEBSPAM, apparently primarily content spammed ∼ 115000 sites ( ∼ 6000 labeled) ∼ 4000 for train set ∼ 2000 for test set document: concatenation of all pages of a site weight directed links by their multiplicity (max weight: ∼ 10) use the topic distribution of a site as features C4.5 on the public content and link features SVM on tf.idf BayesNet on linked LDA features combination by log-odds averaging (Lynam and Cormack) ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Dataset Experiments Results Conclusion and Future Work LDA parameters k - number of topics p - normalization parameter The Dirichlet parameter vector β is constant 200 / | V | , and α is constant 50 / k p = 1 p = 4 p = 10 k = 30 0.768 0.784 0.783 k = 90 0.764 0.777 0.773 Table: Classification accuracy for linked LDA with various parameters, classified by BayesNet. ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Dataset Experiments Results Conclusion and Future Work Baseline methods features AUC Linked LDA with BayesNet 0.784 LDA with BayesNet 0.766 tf.idf with SVM 0.795 public (link) with C4.5 0.724 public (content) with C4.5 0.782 Table: Classification accuracy for the baseline methods. ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Dataset Experiments Results Conclusion and Future Work Combination features AUC tf.idf & LDA 0.827 tf.idf & linked LDA 0.831 public & LDA 0.820 public & linked LDA 0.829 public & tf.idf 0.827 public & tf.idf & LDA 0.845 public & tf.idf & linked LDA 0.854 public & tf.idf & LDA & linked LDA 0.854 Table: Classification accuracy by combining the classifications with a log-odds based random forest. For linked LDA the parameters are chosen to be p = 4 , k = 30. ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Experiments Conclusion and Future Work Conclusion and Future Work Linked LDA slightly outperforms LDA. Combining tf.idf, the public and the linked LDA features with a log-odds based random forest we achieved an AUC of 0.854, beating the Web Spam Challenge 2008 winner (0.848). Measuring the inferred linked LDA edge weights by using them in a stacked graphical classification. ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation Linked LDA Experiments Conclusion and Future Work Questions? jacint@ilab.sztaki.hu, ibiro@ilab.sztaki.hu, sdavid@ilab.sztaki.hu ır´ o, D. Sikl´ osi, J. Szab´ I. B´ o, A. A. Bencz´ ur Linked Latent Dirichlet Allocation in Web Spam Filtering

Linked Latent Dirichlet Allocation in Web Spam Filtering o 1 o 1 - PowerPoint PPT Presentation

Linked Latent Dirichlet Allocation in Web Spam Filtering o 1 o 1 Istv an B r D avid Sikl osi J acint Szab ur 1 Andr as A. Bencz 1 Data Mining and Web Search Group Computer and Automation Institute Hungarian Academy

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

The Dirichlet-Bohr radius Manuel Maestre April 13, 2014 Kent State University Content

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Duality in Logic, Games and Categories Paul-Andr Mellis Institut de Recherche en Informatique

Not alone: What we know from the YRBS 2019 Youth Risk Behavior Survey of Wisconsin high school

From Complexity to Intelligence Introduction to Inductive Reasoning and Proportional Analogy 16

Least and greatest fixed points in ludics 10 September 2015 - CSL 2015 David Baelde, Amina

Generic Properties of Datatypes Roland Backhouse and Paul Hoogendijk Generic Programming Summer

BUILDING A LEARNING CULTURE Ali Shipway What, So what, Now what? Next Steps Outcomes

Simulating Codata Types using Coalgebras Anton Setzer Swansea University, Swansea UK Types 2018,

FST-01SZ (Flying Stone Tiny 01 revision ShenZhen) free hardware design for Gnuk Token Niibe

Linked Latent Dirichlet Allocation in Web Spam Filtering o 1 o 1 - PowerPoint PPT Presentation

Linked Latent Dirichlet Allocation in Web Spam Filtering o 1 o 1 Istv an B r D avid Sikl osi J acint Szab ur 1 Andr as A. Bencz 1 Data Mining and Web Search Group Computer and Automation Institute Hungarian Academy

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

The Dirichlet-Bohr radius Manuel Maestre April 13, 2014 Kent State University Content

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Duality in Logic, Games and Categories Paul-Andr Mellis Institut de Recherche en Informatique

Not alone: What we know from the YRBS 2019 Youth Risk Behavior Survey of Wisconsin high school

From Complexity to Intelligence Introduction to Inductive Reasoning and Proportional Analogy 16

Least and greatest fixed points in ludics 10 September 2015 - CSL 2015 David Baelde, Amina

Generic Properties of Datatypes Roland Backhouse and Paul Hoogendijk Generic Programming Summer

BUILDING A LEARNING CULTURE Ali Shipway What, So what, Now what? Next Steps Outcomes

Simulating Codata Types using Coalgebras Anton Setzer Swansea University, Swansea UK Types 2018,

FST-01SZ (Flying Stone Tiny 01 revision ShenZhen) free hardware design for Gnuk Token Niibe

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All