benjamin markines ciro cattuto filippo menczer
play

Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation - PowerPoint PPT Presentation

Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation Beneficiaries Spammer Intermediary Non-beneficiaries Systems: search engines, tagging Information consumers Original authors ? Surfer


  1. Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation

  2.  Beneficiaries  Spammer  Intermediary  Non-beneficiaries  Systems: search engines, tagging  Information consumers  Original authors  ?  Surfer  Advertisers

  3.  Features  Folksonomy description  Feature descriptions  post level  resource level  user level  Feature analysis  Dataset description  Social spam detection

  4. web tech news alice bob www2009.org wired.com cnn.com F = ( U, T, R, Y ) , Y ⊆ U × T × R (the triples)

  5.  TagSpam spam probability user 1 � f T agSpam ( u, r ) = Pr( t ) | T ( u, r ) | t ∈ T ( u,r ) resource post tags

  6.  TagBlur f TagBlur ( u, r ) = 1 1 1 � σ ( t 1 , t 2 ) + ǫ − 1 + ǫ Z t 1 � = t 2 ∈ T ( u,r ) |tag pairs| tag similarity

  7. random τ = 10 -4 WWW 2009 HT 2009

  8.  DomFp DOM shingle fingerprint spam similarity probability � k ∈ K σ ( k ( r ) , k ) · Pr( k ) f DomF p ( r ) = � k ∈ K σ ( k ( r ) , k )

  9.  NumAds number of ads f NumAds ( r ) = g ( r ) /g max

  10.  Plagiarism number of more authoritative sources matching exact phrase f P lagiarism ( r ) = y ( r ) /y max

  11.  ValidLinks number valid links total number f V alidLinks ( u ) = | V u | of links | R u |

  12.  BibSonomy.org  Spam is labeled at user level  Aggregate for user level 1 � f ( u ) = f ( u, r ) | P ( u ) | ( u,r ) ∈ P ( u ) posts  Sampled 1000 users  500 spammers  500 users in training set/test set  250 spammers

  13. !"#$%&'()*+, !" #!" $!!" $#!" %!!" %#!" &!!" &#!" '!!" '#!" A/0C@/4" A/0B.62" =>4?@" 9/.18:1;<3" 564783" @D/23>;" EF1G3H6/2D8" -./01/2134" !" !($" !(%" !(&" !('" !(#" !()" !(*" !(+" !(," $" ,-*()%./,!.))*0(1./,

  14. 1 0.8 true positive rate (tp) 0.6 0.4 TagSpam TagBlur DomFp 0.2 ValidLinks NumAds Plagiarism 0 0 0.2 0.4 0.6 0.8 1 false positive rate (fp)

  15. !"##$ *$ !"&'$ !"#$ !"%&$ !"&$ !"%$ !"%$ !"')$ !"'($ !"'$ !"#$%&#$ !")$ !"($ !",$ !"+$ !"*$ !$ -./01.2$ -./3456$ 78291$ :.4;<=;>?@$ A52B<@$ C4./;.6;@2$

  16. linear SVM AdaBoost Features FP FP F 1 F 1 Accuracy Accuracy TagSpam 95.82% .061 .957 94.66% .048 .943 + TagBlur 96.75% .048 .966 96.06% .044 .958 + DomFp 96.75% .048 .966 96.06% .044 .958 + ValidLinks 96.52% .048 .964 96.75% .026 .965 + NumAds 96.52% .048 .964 97.22% .026 .970 + Plagiarism 96.75% .048 .966 98.38% .022 .983

  17. 99 percent correctly classified 98 97 96 95 linear SVM AdaBoost 94 1 2 3 4 5 6 number of features

  18.  Web/Email Spam  Attenberg and Suel 2008, Gyöngyi et al. 2004  Social Spam  Heymann et al. 2007  Spam Detection  Krause et al. 2008, Caverlee et al. 2008, Benevenuto et al. 2008, Koutrika et al. 2007  ECML PKDD Discovery Challenge 2008  Held by BibSonomy team  Gkanogiannis and Kalamboukis 2008, Chevalier and Gramme 2008, Kim and Hwang 2008

  19.  Identified/analyzed 6 features for spam detection  TagSpam alone achieves 0.99 ROC AUC outperforming ECML PKDD Discovery Challenge 2008  Accuracy over 98% with AdaBoost  False-positive rate: 0.022  These results set the state of the art  could improve by combining with other features, e.g. Krause et al. 2008  Limitations  Efficiency issues  Bootstrap issues

  20. TagSpam   Depends on a set of labeled tags TagBlur   Depends on a notion of similarity/distances  Assumes spam does not dominate the folksonomy, affecting distances DomFp   Depends on a set of labeled fingerprints  Requires page download NumAds   Requires page download Plagiarism   Requires page download  Search engine cooperation ValidLinks   HEAD request per resource

  21. Features Benjamin Markines   Folksonomy description Ciro Cattuto  Feature descriptions  post level Filippo Menczer  resource level  user level BibSonomy Team Feature analysis  Social spam detection http://www.bibsonomy.org  ISI Foundation

Recommend


More recommend