Fully Personalized PageRank Similarity Search To Randomize or Not To Randomize: Space Optimal Summaries for Hyperlink Analysis Tam´ as Sarl´ os, E¨ otv¨ os University and Computer and Automation Institute, Hungarian Academy of Sciences Joint work with Andr´ as A. Bencz´ ur, K´ aroly Csalog´ any, D´ aniel Fogaras, and Bal´ azs R´ acz Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Contents 1. Efficient algorithms for Fully Personalized PageRank ◮ Definition, motivation and preliminaries ◮ Rounding ◮ Sketching ◮ Lower bounds ◮ Experiments 2. Link-based similarity search with SimRank ◮ Definition ◮ Reduction of SimRank to Personalized PageRank Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Personalized PageRank – Definition and Motivation Definition: random surfer with teleportation distribution r and tel. probab. c ≈ 0 . 15 � PPR r ( u ) = c · r ( u )+(1 − c ) PPR r ( v ) v :( vu ) ∈ E Motivation: Search engines ◮ Improved ranking ◮ Fighting link spam Slow to compute naively with the power method Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Personalized PageRank – Linearity Linearity: PPR α 1 r 1 + α 2 r 2 ( u ) = α 1 PPR r 1 ( u )+ α 2 PPR r 2 ( u ) Single page teleportation suffices: � PPR r ( u ) = r ( v ) · PPR v ( u ) v Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Personalized PageRank – Preliminaries Two-phase algorithm 1. precomputes a PPR database 2. answers PageRank queries using the database Exact PPR on a graph of n ≈ millions . . . billions of vertices: Storage requirement Person. Topic sensitive O ( t · n ) words t ≈ 10 − 100 [Haveliwala 02] topics Hub decomp. O ( h · n ) words h ≈ 100 . 000 [Jeh–Widom 03] pages Ω( n 2 ) bits, infeasible Lower bound of all pages [Fogaras–R´ acz 04] Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Sampling Fully Personalized PageRank ◮ Express PPR u ( v ) as probability of random walk starting at u ending in v ◮ Sample ending points of random walks as above ◮ First algorithm with no restriction on u ◮ Additive error ± ǫ ; out of bounds prob. δ ◮ Uses O ( n · ǫ − 2 log 1 /δ log n ) bits of space Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Power Iteration and Dynamic Programming u Example v 1 v 2 v 3 v 4 v 5 v 6 w Power iteration amplifies the error downwards Dynamic programming [Jeh–Widom WWW 2003] averages the error upward � PPR ( k +1) PPR ( k ) v / d + ( u ) = c χ u + (1 − c ) · u v :( uv ) ∈ E Problem: small world, number of non-zeroes grow quickly in u ’s neighborhood Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Rounded Dynamic Programming Repeat k max = 2 log 1 − c ǫ times for all u � � � � � PPR v / d + ( u ) PPR u = Round k c χ u +(1 − c ) · v :( uv ) ∈ E ◮ Space: n sparse PPR u vectors in O ( n · 1 /ǫ log n ) bits – optimal for top queries ◮ Can gradually decrease rounding error ǫ k from ǫ 1 = 1 to ǫ k max = ǫ ◮ Deterministic output; inductive proof shows PPR u ( v ) − 2 ǫ/ c ≤ � PPR u ( v ) ≤ PPR u ( v ) ◮ Preprocessing: linear O (( n + m ) / ( c ǫ )) time Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Dynamic Programming with Sketches Drunken Surfer ◮ Mix up memories by random hash h ( v ) of pages v � SPPR u ( i ) = PPR u ( v ) for i = 1 , . . ., 2 e /ǫ v : h ( v )= i ◮ Use surfers for j = 1 , . . ., log 1 /δ and use minimum vote: Count-Min Sketch [Cormode–Muthukrishnan 05] � j =1 ,..., log 1 /δ SPPR ( j ) PPR u ( v ) = min u ( h j ( v )) Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Dynamic Programming with Sketches Cont’d ◮ Dynamic programming over sketches by their linearity ◮ A variant also gives linear time preprocessing ◮ O ( n · 1 /ǫ log 1 /δ ) bits of space – optimal for value queries PPR u ( v ) − 2 ǫ/ c − ǫ ≤ � PPR u ( v ) ≤ PPR u ( v ) + 2 ǫ/ c Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Lower Bounds Reduction to one-way communication complexity of bit-vector probing Alice Bob bit string y ∈ { 0 , 1 } s index i , output: y i 1. creates G ( y ) 2. transmits the PPR database of G ( y ) 3. queries the database for PPR u ( i ) ( v ( i )) Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Experiments Stanford WebBase: 80 M nodes, 800 M edges Measured accuracy over 1000 random nodes Effect of rounding with k max = 35 iterations. 1 DP with rounding Maximum Error 0.1 Worst case bound 0.01 0.001 1e-04 1e-05 0.01 0.001 1e-04 1e-05 Rounding error ǫ Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Quality of Approximate Rankings @ t Precision = Recall: | approximate top-t ∩ true top-t | t Kendall’s Tau: 1 − 2#inversions in approximate top-t � t � 2 Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Precision 1 0.95 0.9 Rounding ǫ = 10 − 5 Precision 0.85 Rounding ǫ = 2 · 10 − 5 0.8 Sketch 0.75 Monte Carlo 0.7 BFS 0.65 10 100 1000 Size of top list t Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Kendall’s Tau 1 0.95 0.9 0.85 Kendall’s τ Rounding ǫ = 10 − 5 0.8 Rounding ǫ = 2 · 10 − 5 0.75 0.7 Sketch 0.65 Monte Carlo 0.6 0.55 BFS 0.5 10 100 1000 Size of top list t Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search SimRank – Preliminaries and Sampling “Two pages are similar if pointed to by similar pages” [Jeh–Widom 02] � � Sim ( k − 1) ( u 1 , u 2 ) (1 − c ) · if v 1 � = v 2 Sim ( k ) ( v 1 , v 2 ) = d − ( v 1 ) · d − ( v 2 ) 1 if v 1 = v 2 . (1 − c ) k ′ -weighted path pair summation (incl. sampling [Fogaras–R´ acz 05]) over v 1 = w 0 , w 1 , . . . , w k ′ − 1 , w k ′ = u v 2 = w ′ 0 , w ′ 1 , . . . , w ′ k ′ − 1 , w ′ k ′ = u Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search SimRank – Reduction to Personalized PageRank Version 0 reduction: count path pairs from v 1 and v 2 that may meet several times � (1 − c ) k � Sim (0) RP [ k ] v 1 ( u )RP [ k ] v 1 , v 2 = v 2 ( u ) k > 0 u Recursively define self-similarity SimRank of at least t + 1 inner meeting points as SSim ( t +1) ( v ) Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search SimRank – Reduction to Personalized PageRank Obtain SimRank by inclusion-exclusion of self-similarities � (1 − c ) k � RP [ k ] v 1 ( u )RP [ k ] Sim( v 1 , v 2 ) = v 2 ( u ) · SSim( u ) k > 0 u 1 − SSim (0) ( u ) + SSim (1) ( u ) − SSim (2) ( u ) + . . . SSim( u ) = Converges for 1 − c < 1 / 2, technicalities to carry through approximation Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Conclusion ◮ Efficient algorithms + lower bounds = space-optimal summaries for ◮ Fully Personalized PageRank and for ◮ SimRank with decay factor < 1 / 2 ◮ At the heart of it: low space approximation of large vectors in the � . . . � ∞ norm ◮ Works well in practice Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Thank you! ◮ http://www.ilab.sztaki.hu/websearch Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Fully Personalized PageRank Similarity Search Algorithms Compared Algorithm Running time Dynamic Programming with ǫ = 2 · 1.5 and 10 − 5 and ǫ = 10 − 5 rounding to varying 2.25 days ǫ k Dynamic Programming with ǫ = 6 · 6 days 10 − 3 , δ = 4 · 10 − 3 sketches Monte Carlo sampling with N = 10000 6 days samples Breadth First Search heuristic 3.5 days Tam´ as Sarl´ os et al., Hungarian Academy of Sciences Space Optimal Summaries for Hyperlink Analysis
Recommend
More recommend