start a count for an itemset s b if every proper subset
play

Start a count for an itemset S B if every proper subset of S had a - PowerPoint PPT Presentation

Start a count for an itemset S B if every proper subset of S had a count prior to arrival of basket B Intuitively: If all subsets of S are being counted this means they are frequent/hot and thus S has a potential to be hot


  1. y a m  Power Iteration: y y ½ ½ 0  Set � � = 1 /N a ½ 0 1 a m m 0 ½ 0 � �  1: �′ � = ∑ �→� � � r y = r y /2 + r a /2  2: � = �′ r a = r y /2 + r m  Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 237

  2. Details!  Power iteration: A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)  � ( � ) = � ⋅ � ( � )  � ( � ) = � ⋅ � � = � �� � = � � ⋅ � �  � ( � ) = � ⋅ � � = � � � � � = � � ⋅ � �  Claim: Sequence � ⋅ � � , � � ⋅ � � , … � � ⋅ � � , … approaches the dominant eigenvector of � J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 238

  3. Details!  Claim: Sequence � ⋅ � � , � � ⋅ � � , … � � ⋅ � � , … approaches the dominant eigenvector of �  Proof:  Assume M has n linearly independent eigenvectors, � � , � � , … , � � with corresponding eigenvalues � � , � � , … , � � , where � � > � � > ⋯ > � �  Vectors � � , � � , … , � � form a basis and thus we can write: � ( � ) = � � � � + � � � � + ⋯ + � � � �  �� ( � ) = � � � � � + � � � � + ⋯ + � � � � = � � ( �� � ) + � � ( �� � ) + ⋯ + � � ( �� � ) = � � ( � � � � ) + � � ( � � � � ) + ⋯ + � � ( � � � � )  Repeated multiplication on both sides produces � � � ( � ) = � � ( � � � � � ) + � � ( � � � � � ) + ⋯ + � � ( � � � � � ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 239

  4. Details!  Claim: Sequence � ⋅ � � , � � ⋅ � � , … � � ⋅ � � , … approaches the dominant eigenvector of �  Proof (continued):  Repeated multiplication on both sides produces � � � ( � ) = � � ( � � � � � ) + � � ( � � � � � ) + ⋯ + � � ( � � � � � ) � �  � � � ( � ) = � � � � � � � + � � � � � � � � + ⋯ + � � � � � � � � � � � �  Since � � > � � then fractions � � , � � … < 1 � � � = 0 as � → ∞ (for all � = 2 … � ). and so � �  Thus: � � � ( � ) ≈ � � � � � � �  Note if � � = 0 then the method won’t converge J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 240

  5. i 1 i 2 i 3  Imagine a random web surfer:  At any time � , surfer is on some page �  At time � + � , the surfer follows an j r   out-link from � uniformly at random r i j d out (i)  Ends up on some page � linked from �  i j  Process repeats indefinitely  Let:  � ( � ) … vector whose � th coordinate is the prob. that the surfer is at page � at time �  So, � ( � ) is a probability distribution over pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 241

  6. i 1 i 2 i 3  Where is the surfer at time t+1 ?  Follows a link uniformly at random j � � + � = � ⋅ � ( � )    p ( t 1 ) M p ( t )  Suppose the random walk reaches a state � � + � = � ⋅ � ( � ) = � ( � ) then � ( � ) is stationary distribution of a random walk  Our original rank vector � satisfies � = � ⋅ �  So, � is a stationary distribution for the random walk J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 242

  7.  A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions , the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 243

  8. ( t )   r  ( t 1 ) r  r i Mr or j d equivalently  i j i  Does this converge?  Does it converge to what we want?  Are results reasonable? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 245

  9. ( t )   r  ( t 1 ) r i a b j d  i j i  Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 246

  10. ( t )   r  ( t 1 ) r i a b j d  i j i  Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 247

  11. Dead end 2 problems:  (1) Some pages are dead ends (have no out-links)  Random walk has “nowhere” to go to  Such pages cause importance to “leak out”  (2) Spider traps: (all out-links are within the group)  Random walked gets “stuck” in a trap  And eventually spider traps absorb all importance J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 248

  12. y a m  Power Iteration: y y ½ ½ 0  Set � � = 1 a ½ 0 0 a m m 0 ½ 1 � �  � � = ∑ �→� � � m is a spider trap r y = r y /2 + r a /2  And iterate r a = r y /2 r m = r a /2 + r m  Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 … 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, … All the PageRank score gets “trapped” in node m. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 249

  13.  The Google solution for spider traps: At each time step, the random surfer has two options  With prob.  , follow a link at random  With prob. 1-  , jump to some random page  Common values for  are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps y y a a m m J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 250

  14. y a m  Power Iteration: y y ½ ½ 0  Set � � = 1 a ½ 0 0 a m m 0 ½ 0 � �  � � = ∑ �→� � � r y = r y /2 + r a /2  And iterate r a = r y /2 r m = r a /2  Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 … 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … Here the PageRank “leaks” out since the matrix is not stochastic. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 251

  15.  Teleports: Follow random teleport links with probability 1.0 from dead-ends  Adjust matrix accordingly y y a a m m y a m y a m y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ ⅓ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 252

  16. Why are dead-ends and spider traps a problem and why do teleports solve the problem?  Spider-traps are not a problem, but with traps PageRank scores are not what we want  Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps  Dead-ends are a problem  The matrix is not column stochastic so our initial assumptions are not met  Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 253

  17.  Google’s solution that does it all: At each step, random surfer has two options:  With probability  , follow a link at random  With probability 1-  , jump to some random page  PageRank equation [Brin-Page, 98] � = � � � + (1 − � ) 1 d i … out-degree � � of node i � � � �→� This formulation assumes that � has no dead ends. We can either preprocess matrix � to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 254

  18.  PageRank equation [Brin-Page, ‘98] � = � � � + (1 − � ) 1 � � � � � �→� [1/N] NxN …N by N matrix  The Google Matrix A : where all entries are 1/N 1 � = � � + 1 − � � � × �  We have a recursive problem: � = � ⋅ � And the Power method still works!  What is  ?  In practice  =0.8,0.9 (make 5 steps on avg., jump) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 255

  19. [1/N] NxN M 7/15 y 1/2 1/2 0 1/3 1/3 1/3 + 0.2 0.8 1/2 0 0 1/3 1/3 1/3 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 13/15 a 7/15 1/15 1/15 a m 1/15 7/15 13/15 m A y 1/3 0.33 0.24 0.26 7/33 a = . . . 1/3 0.20 0.20 0.18 5/33 m 1/3 0.46 0.52 0.56 21/33 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 256

  20.  Key step is matrix-vector multiplication  r new = A ∙ r old  Easy if we have enough main memory to hold A , r old , r new  Say N = 1 billion pages A =  ∙M + (1-  ) [1/N] NxN  We need 4 bytes for each entry (say) ½ ½ 0 1/3 1/3 1/3 A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3  2 billion entries for 0 ½ 1 1/3 1/3 1/3 vectors, approx 8GB  Matrix A has N 2 entries 7/15 7/15 1/15  10 18 is a large number! 7/15 1/15 1/15 = 1/15 7/15 13/15 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 258

  21.  Suppose there are N pages  Consider page i , with d i out-links  We have M ji = 1/|d i | when i → j and M ji = 0 otherwise  The random teleport is equivalent to:  Adding a teleport link from i to every other page and setting transition probability to (1-  )/N  Reducing the probability of following each out-link from 1/|d i | to  /|d i |  Equivalent: Tax each page a fraction (1-  ) of its score and redistribute evenly J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 259

  22. ���  � = � ⋅ � , where � �� = � � �� + � � � = ∑  � � �� ⋅ � � ��� ��� � � = ∑  � � � �� + ⋅ � � ��� � ��� � � = ∑ ∑ � � �� ⋅ � � + � � ��� ��� � ��� � = ∑ � � �� ⋅ � � + since ∑� � = 1 ��� � ���  So we get: � = � � ⋅ � + � � Note: Here we assumed M [x] N … a vector of length N with all entries x has no dead-ends J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 260

  23.  We just rearranged the PageRank equation � = �� ⋅ � + � − � � �  where [(1-  )/N] N is a vector with all N entries (1-  )/N  M is a sparse matrix! (with no dead-ends)  10 links per node, approx 10N entries  So in each iteration, we need to:  Compute r new =  M ∙ r old  Add a constant value (1-  )/N to each entry in r new ���  Note if M contains dead-ends then ∑ � � < � and � we also have to renormalize r new so that it sums to 1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 261

  24.  Input: Graph � and parameter �  Directed graph � (can have spider traps and dead ends )  Parameter �  Output: PageRank vector � ��� ��� = �  Set: � � � ��� − � ��� > �  repeat until convergence: ∑ � � � � ��� ��� = ∑ � �  ∀� : �′ � � �→� � � ��� = � if in-degree of � is 0 �′ �  Now re-insert the leaked PageRank: ��� + ��� = � �� ��� ��� ∀� : � � where: � = ∑ �′ � � �  � ��� = � ��� If the graph has no dead-ends then the amount of leaked PageRank is 1- β . But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S . J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 262

  25.  Encode sparse matrix using only nonzero entries  Space proportional roughly to number of links  Say 10N, or 4*10*1 billion = 40GB  Still won’t fit in memory, but will fit on disk source degree destination nodes node 1, 5, 7 0 3 17, 64, 113, 117, 245 1 5 13, 23 2 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 263

  26.  Assume enough RAM to fit r new into memory  Store r old and matrix M on disk  1 step of power-iteration is: Initialize all entries of r new = (1-  ) / N For each page i (of out-degree d i ): Read into memory: i , d i , dest 1 , …, dest d i , r old (i) For j = 1…d i r new (dest j ) +=  r old (i) / d i r new r old source degree destination 0 0 1 1 1, 5, 6 0 3 2 2 17, 64, 113, 117 1 4 3 3 4 4 13, 23 2 2 5 5 6 6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 264

  27.  Assume enough RAM to fit r new into memory  Store r old and matrix M on disk  In each iteration, we have to:  Read r old and M  Write r new back to disk  Cost per iteration of Power method: = 2| r | + | M |  Question:  What if we could not even fit r new in memory? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 265

  28. r old r new src degree destination 0 0 0, 1, 3, 5 0 4 1 1 0, 5 1 2 2 3 3, 4 2 2 2 4 3 M 5 4 5  Break r new into k blocks that fit in memory  Scan M and r old once for each block J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 266

  29.  Similar to nested-loop join in databases  Break r new into k blocks that fit in memory  Scan M and r old once for each block  Total cost:  k scans of M and r old  Cost per iteration of Power method: k (| M | + | r |) + | r | = k | M | + ( k +1)| r |  Can we do better?  Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 267

  30. src degree destination r new 0, 1 0 4 0 r old 0 1 3 1 0 1 2 2 1 2 3 3 0 4 2 4 3 3 2 2 5 5 0 4 4 5 1 3 5 4 2 2 Break M into stripes! Each stripe contains only destination nodes in the corresponding block of r new J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 268

  31.  Break M into stripes  Each stripe contains only destination nodes in the corresponding block of r new  Some additional overhead per stripe  But it is usually worth it  Cost per iteration of Power method: =| M |(1+  ) + ( k +1)| r | J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 269

  32.  Measures generic popularity of a page  Biased against topic-specific authorities  Solution: Topic-Specific PageRank ( next )  Uses a single measure of importance  Other models of importance  Solution: Hubs-and-Authorities  Susceptible to Link spam  Artificial link topographies created in order to boost page rank  Solution: TrustRank J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 270

  33. n lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining of Massive Datasets http://www.mmds.org

  34. A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 272

  35. [1/N] NxN M 0.8·½+0.2·⅓ y 1/2 1/2 0 1/3 1/3 1/3 + 0.2 0.8 1/2 0 0 1/3 1/3 1/3 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 0.8 +0.2·⅓ a 7/15 1/15 1/15 a m 1/15 7/15 13/15 m A y 1/3 0.33 0.24 0.26 7/33 a = . . . 1/3 0.20 0.20 0.18 5/33 m 1/3 0.46 0.52 0.56 21/33 r = A r ��� Equivalently: � = � � ⋅ � + � � J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 273

  36.  Input: Graph � and parameter �  Directed graph � with spider traps and dead ends  Parameter �  Output: PageRank vector � � = �  Set: � � , � = 1 �  do: ( ��� ) ( � ) = ∑ � �  ∀� : �′ � � �→� � � ( � ) = � if in-degree of � is 0 �′ �  Now re-insert the leaked PageRank: � = � � � + ��� If the graph has no dead- ∀� : � � � ends then the amount of � ( � ) leaked PageRank is 1- β . But  � = � + � where: � = ∑ �′ � � since we have dead-ends the ( � ) − � ( ��� ) > � amount of leaked PageRank  while ∑ � � may be larger. We have to � � explicitly account for it by computing S . J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 274

  37.  Measures generic popularity of a page  Will ignore/miss topic-specific authorities  Solution: Topic-Specific PageRank ( next )  Uses a single measure of importance  Other models of importance  Solution: Hubs-and-Authorities  Susceptible to Link spam  Artificial link topographies created in order to boost page rank  Solution: TrustRank J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 275

  38.  Instead of generic popularity, can we measure popularity within a topic?  Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history”  Allows search queries to be answered based on interests of the user  Example: Query “Trojan” wants different pages depending on whether you are interested in sports, history and computer security J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 277

  39.  Random walker has a small probability of teleporting at any step  Teleport can go to:  Standard PageRank: Any page with equal probability  To avoid dead-end and spider-trap problems  Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set)  Idea: Bias the random walk  When walker teleports, she pick a page from a set S  S contains only pages that are relevant to the topic  E.g., Open Directory (DMOZ) pages for a given topic/query  For each teleport set S , we get a different vector r S J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 278

  40.  To make this work all we need is to update the teleportation part of the PageRank formulation: � �� = � � �� + ( � − � )/| � | if � ∈ � � � �� + � otherwise  A is stochastic!  We weighted all pages in the teleport set S equally  Could also assign different weights to pages!  Compute as for regular PageRank:  Multiply by M , then add a vector  Maintains sparseness J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 279

  41. Suppose S = {1} ,  = 0.8 0.2 1 0.5 Node Iteration 0.5 0.4 0 1 2 … stable 0.4 1 0.25 0.4 0.28 0.294 1 2 3 2 0.25 0.1 0.16 0.118 0.8 3 0.25 0.3 0.32 0.327 1 1 4 0.25 0.2 0.24 0.261 0.8 0.8 4 S={1,2,3,4}, β =0.8: r =[0.13, 0.10, 0.39, 0.36] S={1}, β =0.90: S={1,2,3} , β =0.8: r =[0.17, 0.07, 0.40, 0.36] r =[0.17, 0.13, 0.38, 0.30] S={1} , β =0.8: S={1,2} , β =0.8: r =[0.29, 0.11, 0.32, 0.26] r =[0.26, 0.20, 0.29, 0.23] S={1}, β =0.70: S={1} , β =0.8: r =[0.39, 0.14, 0.27, 0.19] r =[0.29, 0.11, 0.32, 0.26] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 280

  42.  Create different PageRanks for different topics  The 16 DMOZ top-level categories:  arts, business, sports,…  Which topic ranking to use?  User can pick from a menu  Classify query into a topic  Can use the context of the query  E.g., query is launched from a web page talking about a known topic  History of queries e.g., “basketball” followed by “Jordan”  User context, e.g., user’s bookmarks, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 281

  43. Random Walk with Restarts: S is a single element

  44. [Tong-Faloutsos, ‘06] I 1 J 1 1 A 1 H 1 B 1 1 D 1 1 1 E G F a.k.a.: Relevance, Closeness, ‘Similarity’… J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 283

  45.  Shortest path is not good:  No effect of degree-1 nodes (E, F, G)!  Multi-faceted relationships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 284

  46.  Network flow is not good:  Does not punish long paths J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 285

  47. [Tong-Faloutsos, ‘06] I 1 J 1 1 A 1 H 1 B • Multiple connections 1 1 D • Quality of connection … • Direct & Indirect 1 1 1 E G connections F • Length, Degree, Weight… J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 286

  48.  SimRank: Random walks from a fixed node on k -partite graphs Conferences Tags Authors  Setting: k -partite graph with k types of nodes  E.g.: Authors, Conferences, Tags  Topic Specific PageRank from node u : teleport set S = { u }  Resulting scores measures similarity to node u  Problem:  Must be done once for each node u  Suitable for sub-Web-scale applications J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 287

  49. … … IJCAI Q: What is most related Philip S. Yu conference to ICDM ? KDD Ning Zhong ICDM A: Topic-Specific R. Ramakrishnan SDM PageRank with M. Jordan AAAI teleport set S={ICDM} … NIPS … Conference Author J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 288

  50. PKDD SDM PAKDD 0.008 0.007 0.009 KDD ICML 0.005 0.011 ICDM 0.005 0.004 CIKM ICDE 0.005 0.004 0.004 ECML SIGMOD DMKD J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 289

  51.  “Normal” PageRank:  Teleports uniformly at random to any node  All nodes have the same probability of surfer landing there: S = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]  Topic-Specific PageRank also known as Personalized PageRank:  Teleports to a topic specific set of pages  Nodes can have different probabilities of surfer landing there: S = [0.1, 0, 0, 0.2, 0, 0, 0.5, 0, 0, 0.2]  Random Walk with Restarts:  Topic-Specific PageRank where teleport is always to the same node. S=[0, 0, 0, 0, 1 , 0, 0, 0, 0, 0, 0] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 290

  52.  Spamming:  Any deliberate action to boost a web page’s position in search engine results, incommensurate with page’s real value  Spam:  Web pages that are the result of spamming  This is a very broad definition  SEO industry might disagree!  SEO = search engine optimization  Approximately 10-15% of web pages are spam J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 292

  53.  Early search engines:  Crawl the Web  Index pages by the words they contained  Respond to search queries (lists of words) with the pages containing those words  Early page ranking:  Attempt to order pages matching a search query by “importance”  First search engines considered:  (1) Number of times query words appeared  (2) Prominence of word position, e.g. title, header J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 293

  54.  As people began to use search engines to find things on the Web, those with commercial interests tried to exploit search engines to bring people to their own site – whether they wanted to be there or not  Example:  Shirt-seller might pretend to be about “movies”  Techniques for achieving high relevance/importance for a web page J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 294

  55.  How do you make your page appear to be about movies?  (1) Add the word movie 1,000 times to your page  Set text color to the background color, so only search engines would see it  (2) Or, run the query “movie” on your target search engine  See what page came first in the listings  Copy it into your page, make it “invisible”  These and similar techniques are term spam J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 295

  56.  Believe what people say about you, rather than what you say about yourself  Use words in the anchor text (words that appear underlined to represent the link) and its surrounding text  PageRank as a tool to measure the “importance” of Web pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 296

  57.  Our hypothetical shirt-seller looses  Saying he is about movies doesn’t help, because others don’t say he is about movies  His page isn’t very important, so it won’t be ranked high for shirts or movies  Example:  Shirt-seller creates 1,000 pages, each links to his with “movie” in the anchor text  These pages have no links in, so they get little PageRank  So the shirt-seller can’t beat truly important movie pages, like IMDB J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 297

  58. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 298

  59. SPAM FARMING J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 299

  60.  Once Google became the dominant search engine, spammers began to work out ways to fool Google  Spam farms were developed to concentrate PageRank on a single page  Link spam:  Creating link structures that boost PageRank of a particular page J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 300

  61.  Three kinds of web pages from a spammer’s point of view  Inaccessible pages  Accessible pages  e.g., blog comments pages  spammer can post links to his pages  Owned pages  Completely controlled by spammer  May span multiple domain names J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 301

  62.  Spammer’s goal:  Maximize the PageRank of target page t  Technique:  Get as many links from accessible pages as possible to target page t  Construct “link farm” to get PageRank multiplier effect J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 302

  63. Accessible Owned 1 Inaccessible 2 t M Millions of farm pages One of the most common and effective organizations for a link farm J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 303

  64. Accessible Owned 1 Inaccessible 2 t N…# pages on the web M…# of pages spammer M owns  x : PageRank contributed by accessible pages  y : PageRank of target page t �� ���  Rank of each “farm” page = � + � �� ��� ���  � = � + �� � + + � � � ��� � ��� = � + � � � + Very small; ignore + Now we solve for y � � � � �  � = ��� � + � where � = � ��� J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 304

  65. Accessible Owned 1 Inaccessible 2 t N…# pages on the web M…# of pages spammer M owns � � �  � = ��� � + � where � = � ���  For  = 0.85, 1/(1-  2 )= 3.6  Multiplier effect for acquired PageRank  By making M large, we can make y as large as we want J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 305

  66.  Combating term spam  Analyze text using statistical methods  Similar to email spam filtering  Also useful: Detecting approximate duplicate pages  Combating link spam  Detection and blacklisting of structures that look like spam farms  Leads to another war – hiding and detecting spam farms  TrustRank = topic-specific PageRank with a teleport set of trusted pages  Example: .edu domains, similar domains for non-US schools J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 307

  67.  Basic principle: Approximate isolation  It is rare for a “good” page to point to a “bad” (spam) page  Sample a set of seed pages from the web  Have an oracle ( human ) to identify the good pages and the spam pages in the seed set  Expensive task , so we must make seed set as small as possible J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 308

Recommend


More recommend