Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation

 Beneficiaries  Spammer  Intermediary  Non-beneficiaries  Systems: search engines, tagging  Information consumers  Original authors  ?  Surfer  Advertisers

 Features  Folksonomy description  Feature descriptions  post level  resource level  user level  Feature analysis  Dataset description  Social spam detection

web tech news alice bob www2009.org wired.com cnn.com F = ( U, T, R, Y ) , Y ⊆ U × T × R (the triples)

 TagSpam spam probability user 1 � f T agSpam ( u, r ) = Pr( t ) | T ( u, r ) | t ∈ T ( u,r ) resource post tags

 TagBlur f TagBlur ( u, r ) = 1 1 1 � σ ( t 1 , t 2 ) + ǫ − 1 + ǫ Z t 1 � = t 2 ∈ T ( u,r ) |tag pairs| tag similarity

random τ = 10 -4 WWW 2009 HT 2009

 DomFp DOM shingle fingerprint spam similarity probability � k ∈ K σ ( k ( r ) , k ) · Pr( k ) f DomF p ( r ) = � k ∈ K σ ( k ( r ) , k )

 NumAds number of ads f NumAds ( r ) = g ( r ) /g max

 Plagiarism number of more authoritative sources matching exact phrase f P lagiarism ( r ) = y ( r ) /y max

 ValidLinks number valid links total number f V alidLinks ( u ) = | V u | of links | R u |

 BibSonomy.org  Spam is labeled at user level  Aggregate for user level 1 � f ( u ) = f ( u, r ) | P ( u ) | ( u,r ) ∈ P ( u ) posts  Sampled 1000 users  500 spammers  500 users in training set/test set  250 spammers

!"#$%&'()*+, !" #!" $!!" $#!" %!!" %#!" &!!" &#!" '!!" '#!" A/0C@/4" A/0B.62" =>4?@" 9/.18:1;<3" 564783" @D/23>;" EF1G3H6/2D8" -./01/2134" !" !($" !(%" !(&" !('" !(#" !()" !(*" !(+" !(," $" ,-*()%./,!.))*0(1./,

1 0.8 true positive rate (tp) 0.6 0.4 TagSpam TagBlur DomFp 0.2 ValidLinks NumAds Plagiarism 0 0 0.2 0.4 0.6 0.8 1 false positive rate (fp)

!"##$ *$ !"&'$ !"#$ !"%&$ !"&$ !"%$ !"%$ !"')$ !"'($ !"'$ !"#$%&#$ !")$ !"($ !",$ !"+$ !"*$ !$ -./01.2$ -./3456$ 78291$ :.4;<=;>?@$ A52B<@$ C4./;.6;@2$

linear SVM AdaBoost Features FP FP F 1 F 1 Accuracy Accuracy TagSpam 95.82% .061 .957 94.66% .048 .943 + TagBlur 96.75% .048 .966 96.06% .044 .958 + DomFp 96.75% .048 .966 96.06% .044 .958 + ValidLinks 96.52% .048 .964 96.75% .026 .965 + NumAds 96.52% .048 .964 97.22% .026 .970 + Plagiarism 96.75% .048 .966 98.38% .022 .983

99 percent correctly classified 98 97 96 95 linear SVM AdaBoost 94 1 2 3 4 5 6 number of features

 Web/Email Spam  Attenberg and Suel 2008, Gyöngyi et al. 2004  Social Spam  Heymann et al. 2007  Spam Detection  Krause et al. 2008, Caverlee et al. 2008, Benevenuto et al. 2008, Koutrika et al. 2007  ECML PKDD Discovery Challenge 2008  Held by BibSonomy team  Gkanogiannis and Kalamboukis 2008, Chevalier and Gramme 2008, Kim and Hwang 2008

 Identified/analyzed 6 features for spam detection  TagSpam alone achieves 0.99 ROC AUC outperforming ECML PKDD Discovery Challenge 2008  Accuracy over 98% with AdaBoost  False-positive rate: 0.022  These results set the state of the art  could improve by combining with other features, e.g. Krause et al. 2008  Limitations  Efficiency issues  Bootstrap issues

TagSpam   Depends on a set of labeled tags TagBlur   Depends on a notion of similarity/distances  Assumes spam does not dominate the folksonomy, affecting distances DomFp   Depends on a set of labeled fingerprints  Requires page download NumAds   Requires page download Plagiarism   Requires page download  Search engine cooperation ValidLinks   HEAD request per resource

Features Benjamin Markines   Folksonomy description Ciro Cattuto  Feature descriptions  post level Filippo Menczer  resource level  user level BibSonomy Team Feature analysis  Social spam detection http://www.bibsonomy.org  ISI Foundation

Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation - PowerPoint PPT Presentation

Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation Beneficiaries Spammer Intermediary Non-beneficiaries Systems: search engines, tagging Information consumers Original authors ? Surfer

ORIENTATION SESSION TO CIRO PROGRAMME CIRO Orientation Programme Marwa Almaskati Director of

Panther Analytics Presented by: Marisol Arredondo Samson (CIRO), Robert Pankey (CIRO), Kristin

The spread of misinformation in social media Filippo Menczer Center for Complex Networks and

Mass spectrometry and Free Software in Debian Filippo Rusconi , Ph.D. filippo.rusconi@u-psud.fr

Deploying Prometheus Filippo Giunchedi - Operations Engineer filippo@wikimedia.org Agenda

Mobile App Development NativeScript e Angular 2+ Kaleidoscope Kaleidoscope filippo Filippo

Understanding Youth Unemployment in Italy via Social Media Data Andrea Bonanomi 1 , Alessandro

Benjamin Banneker and his clock By,Skye Thomas Benjamin Banneker and his clock By, Skye T.

Error Resilient Internet Video Transmission Ciro A. Noronha, Ph.D. Director of Technology,

SoUNd ride I.D. Ciro Dvila SoUNd ride Concept. Sound Ride is inspired in the SUN RIDE

e identification in the NO A Near Detector events Ciro Riccio Supervisors: Xuebing Bu and

The case of LCS 2012 Ciro Baldi, Marilena A. Ciarallo, Stefano De Santis, Rossana Renzi, Graziella

Going be y ond linear regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita Ciro v

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

A solution for Access Delegation based on SAML Ciro Formisano Ermanno Travaglino Isabel

An extension of Stone duality to fuzzy topologies and MV-algebras Ciro Russo Dipartimento di

Development of High Data Readout Rate Pixel Module and Detector Hybridization at Fermilab

State Board of Land Commissioners September 19, 2017 Boise, Idaho Increase pace and scale of

Discovering Similar Passages Within Large Text Documents Demetrios

Jeffrey D. Ullman You can download a free copy of Mining of Massive Datasets , by Jure

Lift Ladder Silver B Probl blem em: 50-100 lb shingle packs 500,000 accidents in

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, Emily Franzini and Greta

Information In formation Sy Systems stems Susan Dumais Microsoft Research

classical? Aaron M. HELFAND Bulfinch awards ceremony institute of Classical Architecture &

Sambuz

Useful Links

Newsletter

Mail Us