Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation
Beneficiaries Spammer Intermediary Non-beneficiaries Systems: search engines, tagging Information consumers Original authors ? Surfer Advertisers
Features Folksonomy description Feature descriptions post level resource level user level Feature analysis Dataset description Social spam detection
web tech news alice bob www2009.org wired.com cnn.com F = ( U, T, R, Y ) , Y ⊆ U × T × R (the triples)
TagSpam spam probability user 1 � f T agSpam ( u, r ) = Pr( t ) | T ( u, r ) | t ∈ T ( u,r ) resource post tags
TagBlur f TagBlur ( u, r ) = 1 1 1 � σ ( t 1 , t 2 ) + ǫ − 1 + ǫ Z t 1 � = t 2 ∈ T ( u,r ) |tag pairs| tag similarity
random τ = 10 -4 WWW 2009 HT 2009
DomFp DOM shingle fingerprint spam similarity probability � k ∈ K σ ( k ( r ) , k ) · Pr( k ) f DomF p ( r ) = � k ∈ K σ ( k ( r ) , k )
NumAds number of ads f NumAds ( r ) = g ( r ) /g max
Plagiarism number of more authoritative sources matching exact phrase f P lagiarism ( r ) = y ( r ) /y max
ValidLinks number valid links total number f V alidLinks ( u ) = | V u | of links | R u |
BibSonomy.org Spam is labeled at user level Aggregate for user level 1 � f ( u ) = f ( u, r ) | P ( u ) | ( u,r ) ∈ P ( u ) posts Sampled 1000 users 500 spammers 500 users in training set/test set 250 spammers
!"#$%&'()*+, !" #!" $!!" $#!" %!!" %#!" &!!" &#!" '!!" '#!" A/0C@/4" A/0B.62" =>4?@" 9/.18:1;<3" 564783" @D/23>;" EF1G3H6/2D8" -./01/2134" !" !($" !(%" !(&" !('" !(#" !()" !(*" !(+" !(," $" ,-*()%./,!.))*0(1./,
1 0.8 true positive rate (tp) 0.6 0.4 TagSpam TagBlur DomFp 0.2 ValidLinks NumAds Plagiarism 0 0 0.2 0.4 0.6 0.8 1 false positive rate (fp)
!"##$ *$ !"&'$ !"#$ !"%&$ !"&$ !"%$ !"%$ !"')$ !"'($ !"'$ !"#$%&#$ !")$ !"($ !",$ !"+$ !"*$ !$ -./01.2$ -./3456$ 78291$ :.4;<=;>?@$ A52B<@$ C4./;.6;@2$
linear SVM AdaBoost Features FP FP F 1 F 1 Accuracy Accuracy TagSpam 95.82% .061 .957 94.66% .048 .943 + TagBlur 96.75% .048 .966 96.06% .044 .958 + DomFp 96.75% .048 .966 96.06% .044 .958 + ValidLinks 96.52% .048 .964 96.75% .026 .965 + NumAds 96.52% .048 .964 97.22% .026 .970 + Plagiarism 96.75% .048 .966 98.38% .022 .983
99 percent correctly classified 98 97 96 95 linear SVM AdaBoost 94 1 2 3 4 5 6 number of features
Web/Email Spam Attenberg and Suel 2008, Gyöngyi et al. 2004 Social Spam Heymann et al. 2007 Spam Detection Krause et al. 2008, Caverlee et al. 2008, Benevenuto et al. 2008, Koutrika et al. 2007 ECML PKDD Discovery Challenge 2008 Held by BibSonomy team Gkanogiannis and Kalamboukis 2008, Chevalier and Gramme 2008, Kim and Hwang 2008
Identified/analyzed 6 features for spam detection TagSpam alone achieves 0.99 ROC AUC outperforming ECML PKDD Discovery Challenge 2008 Accuracy over 98% with AdaBoost False-positive rate: 0.022 These results set the state of the art could improve by combining with other features, e.g. Krause et al. 2008 Limitations Efficiency issues Bootstrap issues
TagSpam Depends on a set of labeled tags TagBlur Depends on a notion of similarity/distances Assumes spam does not dominate the folksonomy, affecting distances DomFp Depends on a set of labeled fingerprints Requires page download NumAds Requires page download Plagiarism Requires page download Search engine cooperation ValidLinks HEAD request per resource
Features Benjamin Markines Folksonomy description Ciro Cattuto Feature descriptions post level Filippo Menczer resource level user level BibSonomy Team Feature analysis Social spam detection http://www.bibsonomy.org ISI Foundation
Recommend
More recommend