y a m Power Iteration: y y ½ ½ 0 Set � � = 1 /N a ½ 0 1 a m m 0 ½ 0 � � 1: �′ � = ∑ �→� � � r y = r y /2 + r a /2 2: � = �′ r a = r y /2 + r m Goto 1 r m = r a /2 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 237
Details! Power iteration: A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue) � ( � ) = � ⋅ � ( � ) � ( � ) = � ⋅ � � = � �� � = � � ⋅ � � � ( � ) = � ⋅ � � = � � � � � = � � ⋅ � � Claim: Sequence � ⋅ � � , � � ⋅ � � , … � � ⋅ � � , … approaches the dominant eigenvector of � J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 238
Details! Claim: Sequence � ⋅ � � , � � ⋅ � � , … � � ⋅ � � , … approaches the dominant eigenvector of � Proof: Assume M has n linearly independent eigenvectors, � � , � � , … , � � with corresponding eigenvalues � � , � � , … , � � , where � � > � � > ⋯ > � � Vectors � � , � � , … , � � form a basis and thus we can write: � ( � ) = � � � � + � � � � + ⋯ + � � � � �� ( � ) = � � � � � + � � � � + ⋯ + � � � � = � � ( �� � ) + � � ( �� � ) + ⋯ + � � ( �� � ) = � � ( � � � � ) + � � ( � � � � ) + ⋯ + � � ( � � � � ) Repeated multiplication on both sides produces � � � ( � ) = � � ( � � � � � ) + � � ( � � � � � ) + ⋯ + � � ( � � � � � ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 239
Details! Claim: Sequence � ⋅ � � , � � ⋅ � � , … � � ⋅ � � , … approaches the dominant eigenvector of � Proof (continued): Repeated multiplication on both sides produces � � � ( � ) = � � ( � � � � � ) + � � ( � � � � � ) + ⋯ + � � ( � � � � � ) � � � � � ( � ) = � � � � � � � + � � � � � � � � + ⋯ + � � � � � � � � � � � � Since � � > � � then fractions � � , � � … < 1 � � � = 0 as � → ∞ (for all � = 2 … � ). and so � � Thus: � � � ( � ) ≈ � � � � � � � Note if � � = 0 then the method won’t converge J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 240
i 1 i 2 i 3 Imagine a random web surfer: At any time � , surfer is on some page � At time � + � , the surfer follows an j r out-link from � uniformly at random r i j d out (i) Ends up on some page � linked from � i j Process repeats indefinitely Let: � ( � ) … vector whose � th coordinate is the prob. that the surfer is at page � at time � So, � ( � ) is a probability distribution over pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 241
i 1 i 2 i 3 Where is the surfer at time t+1 ? Follows a link uniformly at random j � � + � = � ⋅ � ( � ) p ( t 1 ) M p ( t ) Suppose the random walk reaches a state � � + � = � ⋅ � ( � ) = � ( � ) then � ( � ) is stationary distribution of a random walk Our original rank vector � satisfies � = � ⋅ � So, � is a stationary distribution for the random walk J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 242
A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions , the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 243
( t ) r ( t 1 ) r r i Mr or j d equivalently i j i Does this converge? Does it converge to what we want? Are results reasonable? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 245
( t ) r ( t 1 ) r i a b j d i j i Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 246
( t ) r ( t 1 ) r i a b j d i j i Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 247
Dead end 2 problems: (1) Some pages are dead ends (have no out-links) Random walk has “nowhere” to go to Such pages cause importance to “leak out” (2) Spider traps: (all out-links are within the group) Random walked gets “stuck” in a trap And eventually spider traps absorb all importance J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 248
y a m Power Iteration: y y ½ ½ 0 Set � � = 1 a ½ 0 0 a m m 0 ½ 1 � � � � = ∑ �→� � � m is a spider trap r y = r y /2 + r a /2 And iterate r a = r y /2 r m = r a /2 + r m Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 … 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, … All the PageRank score gets “trapped” in node m. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 249
The Google solution for spider traps: At each time step, the random surfer has two options With prob. , follow a link at random With prob. 1- , jump to some random page Common values for are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a a m m J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 250
y a m Power Iteration: y y ½ ½ 0 Set � � = 1 a ½ 0 0 a m m 0 ½ 0 � � � � = ∑ �→� � � r y = r y /2 + r a /2 And iterate r a = r y /2 r m = r a /2 Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 … 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … Here the PageRank “leaks” out since the matrix is not stochastic. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 251
Teleports: Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly y y a a m m y a m y a m y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ ⅓ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 252
Why are dead-ends and spider traps a problem and why do teleports solve the problem? Spider-traps are not a problem, but with traps PageRank scores are not what we want Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps Dead-ends are a problem The matrix is not column stochastic so our initial assumptions are not met Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 253
Google’s solution that does it all: At each step, random surfer has two options: With probability , follow a link at random With probability 1- , jump to some random page PageRank equation [Brin-Page, 98] � = � � � + (1 − � ) 1 d i … out-degree � � of node i � � � �→� This formulation assumes that � has no dead ends. We can either preprocess matrix � to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 254
PageRank equation [Brin-Page, ‘98] � = � � � + (1 − � ) 1 � � � � � �→� [1/N] NxN …N by N matrix The Google Matrix A : where all entries are 1/N 1 � = � � + 1 − � � � × � We have a recursive problem: � = � ⋅ � And the Power method still works! What is ? In practice =0.8,0.9 (make 5 steps on avg., jump) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 255
[1/N] NxN M 7/15 y 1/2 1/2 0 1/3 1/3 1/3 + 0.2 0.8 1/2 0 0 1/3 1/3 1/3 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 13/15 a 7/15 1/15 1/15 a m 1/15 7/15 13/15 m A y 1/3 0.33 0.24 0.26 7/33 a = . . . 1/3 0.20 0.20 0.18 5/33 m 1/3 0.46 0.52 0.56 21/33 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 256
Key step is matrix-vector multiplication r new = A ∙ r old Easy if we have enough main memory to hold A , r old , r new Say N = 1 billion pages A = ∙M + (1- ) [1/N] NxN We need 4 bytes for each entry (say) ½ ½ 0 1/3 1/3 1/3 A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3 2 billion entries for 0 ½ 1 1/3 1/3 1/3 vectors, approx 8GB Matrix A has N 2 entries 7/15 7/15 1/15 10 18 is a large number! 7/15 1/15 1/15 = 1/15 7/15 13/15 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 258
Suppose there are N pages Consider page i , with d i out-links We have M ji = 1/|d i | when i → j and M ji = 0 otherwise The random teleport is equivalent to: Adding a teleport link from i to every other page and setting transition probability to (1- )/N Reducing the probability of following each out-link from 1/|d i | to /|d i | Equivalent: Tax each page a fraction (1- ) of its score and redistribute evenly J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 259
��� � = � ⋅ � , where � �� = � � �� + � � � = ∑ � � �� ⋅ � � ��� ��� � � = ∑ � � � �� + ⋅ � � ��� � ��� � � = ∑ ∑ � � �� ⋅ � � + � � ��� ��� � ��� � = ∑ � � �� ⋅ � � + since ∑� � = 1 ��� � ��� So we get: � = � � ⋅ � + � � Note: Here we assumed M [x] N … a vector of length N with all entries x has no dead-ends J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 260
We just rearranged the PageRank equation � = �� ⋅ � + � − � � � where [(1- )/N] N is a vector with all N entries (1- )/N M is a sparse matrix! (with no dead-ends) 10 links per node, approx 10N entries So in each iteration, we need to: Compute r new = M ∙ r old Add a constant value (1- )/N to each entry in r new ��� Note if M contains dead-ends then ∑ � � < � and � we also have to renormalize r new so that it sums to 1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 261
Input: Graph � and parameter � Directed graph � (can have spider traps and dead ends ) Parameter � Output: PageRank vector � ��� ��� = � Set: � � � ��� − � ��� > � repeat until convergence: ∑ � � � � ��� ��� = ∑ � � ∀� : �′ � � �→� � � ��� = � if in-degree of � is 0 �′ � Now re-insert the leaked PageRank: ��� + ��� = � �� ��� ��� ∀� : � � where: � = ∑ �′ � � � � ��� = � ��� If the graph has no dead-ends then the amount of leaked PageRank is 1- β . But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S . J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 262
Encode sparse matrix using only nonzero entries Space proportional roughly to number of links Say 10N, or 4*10*1 billion = 40GB Still won’t fit in memory, but will fit on disk source degree destination nodes node 1, 5, 7 0 3 17, 64, 113, 117, 245 1 5 13, 23 2 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 263
Assume enough RAM to fit r new into memory Store r old and matrix M on disk 1 step of power-iteration is: Initialize all entries of r new = (1- ) / N For each page i (of out-degree d i ): Read into memory: i , d i , dest 1 , …, dest d i , r old (i) For j = 1…d i r new (dest j ) += r old (i) / d i r new r old source degree destination 0 0 1 1 1, 5, 6 0 3 2 2 17, 64, 113, 117 1 4 3 3 4 4 13, 23 2 2 5 5 6 6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 264
Assume enough RAM to fit r new into memory Store r old and matrix M on disk In each iteration, we have to: Read r old and M Write r new back to disk Cost per iteration of Power method: = 2| r | + | M | Question: What if we could not even fit r new in memory? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 265
r old r new src degree destination 0 0 0, 1, 3, 5 0 4 1 1 0, 5 1 2 2 3 3, 4 2 2 2 4 3 M 5 4 5 Break r new into k blocks that fit in memory Scan M and r old once for each block J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 266
Similar to nested-loop join in databases Break r new into k blocks that fit in memory Scan M and r old once for each block Total cost: k scans of M and r old Cost per iteration of Power method: k (| M | + | r |) + | r | = k | M | + ( k +1)| r | Can we do better? Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 267
src degree destination r new 0, 1 0 4 0 r old 0 1 3 1 0 1 2 2 1 2 3 3 0 4 2 4 3 3 2 2 5 5 0 4 4 5 1 3 5 4 2 2 Break M into stripes! Each stripe contains only destination nodes in the corresponding block of r new J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 268
Break M into stripes Each stripe contains only destination nodes in the corresponding block of r new Some additional overhead per stripe But it is usually worth it Cost per iteration of Power method: =| M |(1+ ) + ( k +1)| r | J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 269
Measures generic popularity of a page Biased against topic-specific authorities Solution: Topic-Specific PageRank ( next ) Uses a single measure of importance Other models of importance Solution: Hubs-and-Authorities Susceptible to Link spam Artificial link topographies created in order to boost page rank Solution: TrustRank J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 270
n lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining of Massive Datasets http://www.mmds.org
A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 272
[1/N] NxN M 0.8·½+0.2·⅓ y 1/2 1/2 0 1/3 1/3 1/3 + 0.2 0.8 1/2 0 0 1/3 1/3 1/3 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 0.8 +0.2·⅓ a 7/15 1/15 1/15 a m 1/15 7/15 13/15 m A y 1/3 0.33 0.24 0.26 7/33 a = . . . 1/3 0.20 0.20 0.18 5/33 m 1/3 0.46 0.52 0.56 21/33 r = A r ��� Equivalently: � = � � ⋅ � + � � J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 273
Input: Graph � and parameter � Directed graph � with spider traps and dead ends Parameter � Output: PageRank vector � � = � Set: � � , � = 1 � do: ( ��� ) ( � ) = ∑ � � ∀� : �′ � � �→� � � ( � ) = � if in-degree of � is 0 �′ � Now re-insert the leaked PageRank: � = � � � + ��� If the graph has no dead- ∀� : � � � ends then the amount of � ( � ) leaked PageRank is 1- β . But � = � + � where: � = ∑ �′ � � since we have dead-ends the ( � ) − � ( ��� ) > � amount of leaked PageRank while ∑ � � may be larger. We have to � � explicitly account for it by computing S . J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 274
Measures generic popularity of a page Will ignore/miss topic-specific authorities Solution: Topic-Specific PageRank ( next ) Uses a single measure of importance Other models of importance Solution: Hubs-and-Authorities Susceptible to Link spam Artificial link topographies created in order to boost page rank Solution: TrustRank J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 275
Instead of generic popularity, can we measure popularity within a topic? Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history” Allows search queries to be answered based on interests of the user Example: Query “Trojan” wants different pages depending on whether you are interested in sports, history and computer security J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 277
Random walker has a small probability of teleporting at any step Teleport can go to: Standard PageRank: Any page with equal probability To avoid dead-end and spider-trap problems Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set) Idea: Bias the random walk When walker teleports, she pick a page from a set S S contains only pages that are relevant to the topic E.g., Open Directory (DMOZ) pages for a given topic/query For each teleport set S , we get a different vector r S J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 278
To make this work all we need is to update the teleportation part of the PageRank formulation: � �� = � � �� + ( � − � )/| � | if � ∈ � � � �� + � otherwise A is stochastic! We weighted all pages in the teleport set S equally Could also assign different weights to pages! Compute as for regular PageRank: Multiply by M , then add a vector Maintains sparseness J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 279
Suppose S = {1} , = 0.8 0.2 1 0.5 Node Iteration 0.5 0.4 0 1 2 … stable 0.4 1 0.25 0.4 0.28 0.294 1 2 3 2 0.25 0.1 0.16 0.118 0.8 3 0.25 0.3 0.32 0.327 1 1 4 0.25 0.2 0.24 0.261 0.8 0.8 4 S={1,2,3,4}, β =0.8: r =[0.13, 0.10, 0.39, 0.36] S={1}, β =0.90: S={1,2,3} , β =0.8: r =[0.17, 0.07, 0.40, 0.36] r =[0.17, 0.13, 0.38, 0.30] S={1} , β =0.8: S={1,2} , β =0.8: r =[0.29, 0.11, 0.32, 0.26] r =[0.26, 0.20, 0.29, 0.23] S={1}, β =0.70: S={1} , β =0.8: r =[0.39, 0.14, 0.27, 0.19] r =[0.29, 0.11, 0.32, 0.26] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 280
Create different PageRanks for different topics The 16 DMOZ top-level categories: arts, business, sports,… Which topic ranking to use? User can pick from a menu Classify query into a topic Can use the context of the query E.g., query is launched from a web page talking about a known topic History of queries e.g., “basketball” followed by “Jordan” User context, e.g., user’s bookmarks, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 281
Random Walk with Restarts: S is a single element
[Tong-Faloutsos, ‘06] I 1 J 1 1 A 1 H 1 B 1 1 D 1 1 1 E G F a.k.a.: Relevance, Closeness, ‘Similarity’… J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 283
Shortest path is not good: No effect of degree-1 nodes (E, F, G)! Multi-faceted relationships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 284
Network flow is not good: Does not punish long paths J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 285
[Tong-Faloutsos, ‘06] I 1 J 1 1 A 1 H 1 B • Multiple connections 1 1 D • Quality of connection … • Direct & Indirect 1 1 1 E G connections F • Length, Degree, Weight… J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 286
SimRank: Random walks from a fixed node on k -partite graphs Conferences Tags Authors Setting: k -partite graph with k types of nodes E.g.: Authors, Conferences, Tags Topic Specific PageRank from node u : teleport set S = { u } Resulting scores measures similarity to node u Problem: Must be done once for each node u Suitable for sub-Web-scale applications J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 287
… … IJCAI Q: What is most related Philip S. Yu conference to ICDM ? KDD Ning Zhong ICDM A: Topic-Specific R. Ramakrishnan SDM PageRank with M. Jordan AAAI teleport set S={ICDM} … NIPS … Conference Author J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 288
PKDD SDM PAKDD 0.008 0.007 0.009 KDD ICML 0.005 0.011 ICDM 0.005 0.004 CIKM ICDE 0.005 0.004 0.004 ECML SIGMOD DMKD J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 289
“Normal” PageRank: Teleports uniformly at random to any node All nodes have the same probability of surfer landing there: S = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] Topic-Specific PageRank also known as Personalized PageRank: Teleports to a topic specific set of pages Nodes can have different probabilities of surfer landing there: S = [0.1, 0, 0, 0.2, 0, 0, 0.5, 0, 0, 0.2] Random Walk with Restarts: Topic-Specific PageRank where teleport is always to the same node. S=[0, 0, 0, 0, 1 , 0, 0, 0, 0, 0, 0] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 290
Spamming: Any deliberate action to boost a web page’s position in search engine results, incommensurate with page’s real value Spam: Web pages that are the result of spamming This is a very broad definition SEO industry might disagree! SEO = search engine optimization Approximately 10-15% of web pages are spam J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 292
Early search engines: Crawl the Web Index pages by the words they contained Respond to search queries (lists of words) with the pages containing those words Early page ranking: Attempt to order pages matching a search query by “importance” First search engines considered: (1) Number of times query words appeared (2) Prominence of word position, e.g. title, header J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 293
As people began to use search engines to find things on the Web, those with commercial interests tried to exploit search engines to bring people to their own site – whether they wanted to be there or not Example: Shirt-seller might pretend to be about “movies” Techniques for achieving high relevance/importance for a web page J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 294
How do you make your page appear to be about movies? (1) Add the word movie 1,000 times to your page Set text color to the background color, so only search engines would see it (2) Or, run the query “movie” on your target search engine See what page came first in the listings Copy it into your page, make it “invisible” These and similar techniques are term spam J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 295
Believe what people say about you, rather than what you say about yourself Use words in the anchor text (words that appear underlined to represent the link) and its surrounding text PageRank as a tool to measure the “importance” of Web pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 296
Our hypothetical shirt-seller looses Saying he is about movies doesn’t help, because others don’t say he is about movies His page isn’t very important, so it won’t be ranked high for shirts or movies Example: Shirt-seller creates 1,000 pages, each links to his with “movie” in the anchor text These pages have no links in, so they get little PageRank So the shirt-seller can’t beat truly important movie pages, like IMDB J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 297
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 298
SPAM FARMING J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 299
Once Google became the dominant search engine, spammers began to work out ways to fool Google Spam farms were developed to concentrate PageRank on a single page Link spam: Creating link structures that boost PageRank of a particular page J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 300
Three kinds of web pages from a spammer’s point of view Inaccessible pages Accessible pages e.g., blog comments pages spammer can post links to his pages Owned pages Completely controlled by spammer May span multiple domain names J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 301
Spammer’s goal: Maximize the PageRank of target page t Technique: Get as many links from accessible pages as possible to target page t Construct “link farm” to get PageRank multiplier effect J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 302
Accessible Owned 1 Inaccessible 2 t M Millions of farm pages One of the most common and effective organizations for a link farm J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 303
Accessible Owned 1 Inaccessible 2 t N…# pages on the web M…# of pages spammer M owns x : PageRank contributed by accessible pages y : PageRank of target page t �� ��� Rank of each “farm” page = � + � �� ��� ��� � = � + �� � + + � � � ��� � ��� = � + � � � + Very small; ignore + Now we solve for y � � � � � � = ��� � + � where � = � ��� J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 304
Accessible Owned 1 Inaccessible 2 t N…# pages on the web M…# of pages spammer M owns � � � � = ��� � + � where � = � ��� For = 0.85, 1/(1- 2 )= 3.6 Multiplier effect for acquired PageRank By making M large, we can make y as large as we want J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 305
Combating term spam Analyze text using statistical methods Similar to email spam filtering Also useful: Detecting approximate duplicate pages Combating link spam Detection and blacklisting of structures that look like spam farms Leads to another war – hiding and detecting spam farms TrustRank = topic-specific PageRank with a teleport set of trusted pages Example: .edu domains, similar domains for non-US schools J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 307
Basic principle: Approximate isolation It is rare for a “good” page to point to a “bad” (spam) page Sample a set of seed pages from the web Have an oracle ( human ) to identify the good pages and the spam pages in the seed set Expensive task , so we must make seed set as small as possible J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 308
Recommend
More recommend