Web Dynamics Web Spam Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrücken, June 24, 2010 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-1/49
Agenda • Web spam - Why and what? Web Spam - Spam taxonomy Overview Dr. Marc Spaniol Strategies in detail o Link spam o Link farms Examples • Countermeasures - Spam detection - Labeling and assessment - Combating spam - Web spam challenge Databases and • Conclusion Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-2/49
Web Spam Why? “Spam industry had a revenue Time elapsed to reach hit position Time spent looking at hit position Web Spam potential of $4.5 billion in year 2004 if they had been able to completely Dr. Marc Spaniol fool all search engines on all commercially viable queries” [Amitay 2004] Databases and Information Systems [Granka, Joachims, Gay 2004] Prof. Dr. G. Weikum MPII-Sp-0710-3/49
Web Spam What’s the Problem? 2004 .de crawl Unknown 0.4% Courtesy: T. Suel Alias 0.3% Web Spam Empty 0.4% Non-existent 7.9% Dr. Marc Spaniol Ad 3.7% Weborg 0.8% Spam 16.5% Reputable 70.0% Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-4/49
Web Spam • Target of spammers - Not end users (directly) Web Spam - High revenue from customers for search engine “optimization” (especially Google) - Indirect revenue Dr. Marc Spaniol Affiliate programs, Google AdSense Ad display, traffic funneling • Spam taxonomy - Content spam Keywords Popular expressions Mis-spellings - Link spam “farms” Densely connected sites Redirects - Cloaking and hiding Databases and Information Systems [Benczúr et al. 2008] - Spam in social media Prof. Dr. G. Weikum MPII-Sp-0710-5/49
Overview Spamming Web Spam Dr. Marc Spaniol Boosting Hiding Term Links Content Hiding Cloaking Redirection Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-6/49
Spammed Ranking Elements • Term frequency (tf in the tf.idf, Okapi BM25 etc. ranking schemes) • Term frequency weighted by HTML elements Web Spam - Title - Headers Dr. Marc Spaniol - Font size - Face • Heaviest weight in ranking - URL, domain name part - Anchor text: <a href=“…”>Best Saarbruecken nightlife</a> • Structural information - URL length - Depth from server root - Indegree - PageRank - Link based centrality Databases and ⇒ All Web information retrieval ranking elements spammed Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-7/49
Content Spam • Domain name adjustableloanmortgagemastersonline.compay.dahannusaprima.co.uk Web Spam buy-canon-rebel-20d-lens-case.camerasx.com Dr. Marc Spaniol • Anchor text (title, H1, etc) <a href=“target.html”>free, great deals, cheap, inexpensive, cheap, free</a> • Meta keywords <meta name=“keywords” content=“UK Swingers, UK, swingers, swinging, genuine, adult contacts, connect4fun, sex, …”> Databases and Information Systems [Gyöngyi, Garcia-Molina, 2005] Prof. Dr. G. Weikum MPII-Sp-0710-8/49
Parking Domain Web Spam Dr. Marc Spaniol <div style="position:absolute; top:20px; width:600px; height:90px; overflow:hidden;"><font size=-1>atangledweb.co.uk currently offline<br>atangledweb.co.uk back soon<br></font><br><br><a href="http://www.atangledweb.co.uk"><font size=-1>atangledweb.co.uk</font></a><br><br><br> Soundbridge HomeMusic WiFi Media Play<a class=l href="http://www.atangledweb.co.uk/index01.html">-</a>... SanDisk Sansa e250 - 2GB MP3 Player -<a class=l href="http://www.atangledweb.co.uk/index02.html">-</a>... AIGO F820+ 1GB Beach inspired MP3 Pla<a class=l href="http://www.atangledweb.co.uk/index03.html">-</a>... Targus I-Pod Mini Sound Enhancer<a class=l href="http://www.atangledweb.co.uk/index04.html">-</a>... Sony NWA806FP.CE7 4GB video WALKMAN <a class=l href="http://www.atangledweb.co.uk/index05.html">-</a>... Ministry of Sound 512MB MP3 player<a class=l href="http://www.mp3roze.co.uk/cat7000.html">-</a>... Nokia 6125 - Fold Design - 1.3 Megapi<a class=l href="http://www.mp3roze.co.uk/cat7001.html">-</a>... Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-9/49
Keyw ord Stuffing & Generated Copies Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-10/49
Google ads Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-11/49
Link Spam “Hyperlink structure contains an enormous amount of latent human annotation that can be extremely valuable for automatically Web Spam inferring notions of authority.” (Chakrabarti et. al. ’99) Dr. Marc Spaniol • Hyperlinks: Good, Bad, Ugly Honest link, human annotation No value of recommendation, e.g. “affiliate programs”, navigation, ads … Deliberate manipulation, link spam Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-12/49
PageRank Outgoing links from p i PageRank of page p 0 : Web Spam p 0 = c Σ i p i /|F(i)| + (1-c) Dr. Marc Spaniol PageRank of random jump damping p i pointing to p 0 Generalized (vector): factor (1 – c) 1 N cT’p p = + N “1” vector Score vector Transition matrix • One page is important if it is pointed to by many other pages • Based on the link structure Databases and ⇒ The algorithm of PageRank is vulnerable to link spamming Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-13/49
Link Farms • Entry point from the honest web - Honey pots: Copies of quality content Web Spam - Dead links to parking domain - Blog or guestbook comment spam Dr. Marc Spaniol Hijacked Farm W W W Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-14/49
Spam Farm: Pages λ 1 p 1 Web Spam λ 0 λ 2 ? p 2 p 0 Dr. Marc Spaniol Target page λ k p k • Each farm has only one • The target of the spammer is to increase this page’s ranking Boosting pages • Controlled by the spammer • Pointing to the target page in order to increase its PageRank Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-15/49
Spam Farm: External Links λ 1 p 1 Web Spam λ 0 λ 2 ? Dr. Marc Spaniol p 2 p 0 λ k p k Leakage • Fractions of PageRank • Link to the pages are added from pages outside the Farm (forum, blog, …) • The spammer has no or limited control on them Databases and • λ = λ 0 + … + λ k Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-16/49
Simple Farm Model 1 W W W Web Spam 2 p 0 Dr. Marc Spaniol • PageRank of target page: p 0 k • Number of all pages: N • Damping factor: c • Leakage contributed by accessible pages: λ • PageRank of each farm page: (1-c)/N p 0 = λ+ k*c*[(1 -c)/N] + (1-c)/N = λ+ [(1 -c)(ck+1)]/N ⇒ By making k large, we can make p 0 as large as we want Databases and Information Systems ⇒ No multiplier effect for “acquired” page rank Prof. Dr. G. Weikum MPII-Sp-0710-17/49
Optimal Farm Model 1 W W W Web Spam 2 q 0 Dr. Marc Spaniol • Optimal PageRank of target page: q 0 • Number of all pages: N • Damping factor: c k • Leakage contributed by accessible pages: λ • PageRank of each farm page: cq 0 /k + (1-c)/N q 0 = λ+ ck[cq 0 /k + (1-c)/N] + (1-c)/N = λ+ c 2 q 0 + c(1-c)k/N + (1-c)/N … = λ/(1 -c 2 ) + [(1-c)(ck+1)]/N(1-c 2 ) = p 0 /(1-c 2 ) ⇒ By making k large, we can make q 0 as large as we want ⇒ For c = 0.85 “performance” gain: 1/(1-c 2 ) = 3.6 Databases and Information Systems ⇒ Multiplier effect for “acquired” page rank Prof. Dr. G. Weikum MPII-Sp-0710-18/49
Simple vs. Optimal Farm Simple: Optimal: Optimal: Each boosting page only points to the Short reinforcement loop(s) The target points to all boosting pages Web Spam target page There are no links among boosting pages Fewer links (1 – c)(ck + 1) Dr. Marc Spaniol p 0 = c λ q 0 = p 0 / (1 – c 2 ) r 0 = p 0 / (1 – c 2 ) + N q 1 p 1 r 2 λ λ λ q 2 q 0 r 0 p 2 p 0 r k r 1 q k p k Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-19/49
Optimality w ithout Leakage For mathematical simplification only Idea: Interpret leakage as additional boosting pages Web Spam q k+1 q 1 Dr. Marc Spaniol q 0 q 2 q k+m m = λ N / (1-c) q k c λ (1 – c)(ck + 1) ! (1 – c)[c(k + m) + 1] + = Databases and (1 – c 2 ) (1 – c 2 )N (1 – c 2 )N Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-20/49
Alliance of tw o Farms Intuitive: Each boosting page points to both targets and vice versa Web Spam p 0 q 0 Dr. Marc Spaniol p 1 p 2 p k q 1 q 2 q m 2(k + m) new links p 0 = c Σ i=1,...,k p i /2 + c Σ j=1,...,m q j /2 + (1-c)/N Redistribution of PageRank q 0 = c Σ i=1,...,k p i /2 + c Σ j=1,...,m q j /2 + (1-c)/N c(k + m)/2 + 1 p i = c(p 0 +q 0 )/(k+m) + (1-c)/N, i= 1,...,k p 0 = q 0 = (1 + c)N Databases and q j = c(p 0 +q 0 )/(k+m) + (1-c)/N, j= 1,...,m Information Systems Convenient for the smaller Farm Prof. Dr. G. Weikum MPII-Sp-0710-21/49
Recommend
More recommend