Link Spam Alliances Zoltán Gyöngyi Hector Garcia-Molina
Class List � Spam 101 — Intro to web spam � Spam 221 — Spamming PageRank • Spam farm model • Optimal farm structure • Alliances of two farms • Larger alliances � Spam 321 — Link spam detection seminar Very Large Data Bases ● Trondheim, September 1, 2005 2
Spam 101 kaiser pharmacy online Very Large Data Bases ● Trondheim, September 1, 2005 3
Spam 101 Save today on Viagra, Lipitor, Zoloft, … Phentermine 90 Pills/$119 Very Large Data Bases ● Trondheim, September 1, 2005 4
Spam 101 Pet shops commonly carry fish for home aquariums, small birds such as parakeets, small mammals such as fancy rats and hamsters… Very Large Data Bases ● Trondheim, September 1, 2005 5
Spam 101 Pharmacy is the profession of compounding and dispensing medication. More recently, the term has come to include other services… Lawyers Loans Mortgage Ringtones Viagra Very Large Data Bases ● Trondheim, September 1, 2005 6
Spam 101 Spamming = misleading search engines to obtain higher-than-deserved ranking Link spamming = building link structures that boost PageRank score Very Large Data Bases ● Trondheim, September 1, 2005 7
Spam 221: PageRank A page is important if many important pages point to it p 0 = c ∑ i p i / out(i) + (1 – c) PageRank PageRank of page p i of page p 0 that points to page p o Very Large Data Bases ● Trondheim, September 1, 2005 8
Spam 221: PageRank Random jump Damping probability ≈ 0.15 factor ≈ 0.85 (uniform static score) p 0 = c ∑ i p i / out(i) + (1 – c) Outdegree of page p i Very Large Data Bases ● Trondheim, September 1, 2005 9
Spam 221: Spam Farm Model 1 1 2 0 2 ? k 0 k Very Large Data Bases ● Trondheim, September 1, 2005 10
Spam 221: Spam Farm Model � Single target page p 0 • Increase exposure • In particular, increase PageRank Very Large Data Bases ● Trondheim, September 1, 2005 11
Spam 221: Spam Farm Model Canada Rx Cheap Canadian drugs here import pharmacy online best prescriptions discount savings � Boosting pages p 1 , …, p k • Owned/controlled by spammer Very Large Data Bases ● Trondheim, September 1, 2005 12
Spam 221: Spam Farm Model � Leakages λ 0 , …, λ k • Fractions of PageRank • Through hijacked links – Spammer has limited access to source page • λ = λ 0 + ··· + λ k Joe’s Blog Posted on 04/28/05 … Comments Great thoughts! I also wrote about this issue in my blog . (by as7869 ) Very Large Data Bases ● Trondheim, September 1, 2005 13
Spam 221: Optimal Farm � Optimal � Simple p 0 = λ + (1 – c)(c k + 1) q 0 = p 0 / (1 – c 2 ) • Every link points • Links to boosting pages to p 0 • 3.6x increase in target PageRank For c = 0.85 p 1 q 1 λ λ p 2 q 2 p 0 q 0 p k q k Very Large Data Bases ● Trondheim, September 1, 2005 14
Spam 221: Optimal Farm � Optimal � Optimal #2 q 0 = p 0 / (1 – c 2 ) r 0 = p 0 / (1 – c 2 ) • Links to boosting pages • Same PageRank • 3.6x increase in target • Fewer links PageRank r 2 q 1 λ λ r 3 q 2 q 0 r 0 r 1 r k q k Very Large Data Bases ● Trondheim, September 1, 2005 15
Spam 221: Optimal Farm � Optimal � Optimal #2 q 0 = p 0 / (1 – c 2 ) r 0 = p 0 / (1 – c 2 ) • Links to boosting pages • Same PageRank • 3.6x increase in target • Fewer links PageRank Lesson #1 : r 2 q 1 λ Short loop(s) increase target PageRank λ r 3 q 2 q 0 r 0 r 1 r k q k Very Large Data Bases ● Trondheim, September 1, 2005 16
Spam 221: Two Farms � Alliances = interconnected farms • Single spammer, several target pages/farms • Multiple spammers What happens if you and I team up? Very Large Data Bases ● Trondheim, September 1, 2005 17
Spam 221: Two Farms � We can do this… � … but it won’t help: d = c / (1 + c) target scores balance out p 0 = q 0 = d (k + m) / 2 Very Large Data Bases ● Trondheim, September 1, 2005 18
Spam 221: Two Farms � However, we can also do this… • Remove the links to boosting pages p 1 q 1 p 2 p 0 q 0 q 2 p k q m � … and both target scores increase • For k = m, we have a 6.7x increase p 0 = d k + c d m + 1 q 0 = d m + c d k + 1 Very Large Data Bases ● Trondheim, September 1, 2005 19
Spam 221: Two Farms � However, we can also do this… • Remove the links to boosting pages Lesson #2 : p 1 q 1 Target pages should only link to other targets p 2 p 0 q 0 q 2 p k q m � … and both target scores increase Lesson #3 : In an alliance of two, both participants win • For k = m, we have a 6.7x increase p 0 = d k + c d m + 1 q 0 = d m + c d k + 1 Very Large Data Bases ● Trondheim, September 1, 2005 20
Spam 221: Larger Alliances � “Extremes” • Ring core • Completely connected core Very Large Data Bases ● Trondheim, September 1, 2005 21
Spam 221: Larger Alliances � Target scores for ring/complete cores • 10 farms of sizes 1000, 2000, …, 10000 6000 5000 Complete k n a 4000 R e g Ring a 3000 P t e Problem: farm 10 g r 2000 a “loses” in a ring T 1000 Optimal Single 0 1 2 3 4 5 6 7 8 9 10 Farm Number Very Large Data Bases ● Trondheim, September 1, 2005 22
Spam 221: Larger Alliances � Target scores for ring/complete cores • 10 farms of sizes 1000, 2000, …, 10000 6000 5000 Complete k n a 4000 R Lesson #4 : e g Ring a 3000 Larger alliances need to be stable to keep P t e Problem: farm 10 g all participants happy r 2000 a “loses” in a ring T 1000 Optimal Single 0 1 2 3 4 5 6 7 8 9 10 Farm Number Very Large Data Bases ● Trondheim, September 1, 2005 23
Spam 221: Larger Alliances � Stable alliance = no farm has incentive to split off • Alliances of two are always stable • Larger alliances are not necessarily stable � Dynamics see paper • Should a new farm be added? • What about adding more boosting pages? • When/with whom should a farm split off? • Should a “loser” be compensated? Very Large Data Bases ● Trondheim, September 1, 2005 24
Spam 321: Spam Detection � Identifying regular structures • Inlink/outlink/PageRank distribution “unnatural” • Fetterly et al. , 2004 • Benczúr et al. , 2005 p 1 λ p 2 p 1 = p 2 = ··· = p k p 0 p k Very Large Data Bases ● Trondheim, September 1, 2005 25
Spam 321: Spam Detection � Detecting collusion • Alliance cores preserve (capture) PageRank • Zhang et al. , 2004 p 1 q 1 p 2 p 0 q 0 q 2 p k q m (p 0 + q 0 ) / ( ∑ i p i + ∑ j q j ) ≈ c / (1 – c) Very Large Data Bases ● Trondheim, September 1, 2005 26
Spam 321: Spam Detection � Estimating spam mass • Target PageRank depends on boosting • Work in progress 0 λ 0 p' 0 0 (p 0 – p' 0 ) / p 0 large Very Large Data Bases ● Trondheim, September 1, 2005 27
Review Session � Link spammers target PageRank � Spam farm model • Single target page • Boosting pages + leakage � Alliances of two • Always better than alone � Larger alliances • Different core structures • Not necessarily stable – Conditions on joining and leaving Very Large Data Bases ● Trondheim, September 1, 2005 28
Review Session � Related work • Bianchini et al. , 2005. Inside PageRank • Langville and Meyer, 2004. Deeper Inside PageRank • Baeza-Yates et al. , 2005. PageRank Increase under Different Collusion Topologies � Future work • Spam detection • Cost model extension Very Large Data Bases ● Trondheim, September 1, 2005 29
Spam 221: Larger Alliances � Various core structures • 4 farms of size 50 • One target probed (others symmetrical) 160 k n 130 a ring R e g 100 a P t e 70 g r a s T h 100 p 40 80 a r 60 G 40 # f o 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Score Group Very Large Data Bases ● Trondheim, September 1, 2005 30
Recommend
More recommend