Chapter 5: Link Analysis for Authority Scoring 5.1 PageRank (S. Brin and L. Page 1997/1998) 5.2 HITS (J. Kleinberg 1997/1999) 5.3 Comparison and Extensions 5.4 Topic-specific and Personalized PageRank 5.5 Efficiency Issues 5.6 Online Page Importance 5.7 Spam-Resilient Authority Scoring 5-1 IRDM WS 2005
5.3 Comparison and Extensions Literature contains plethora of variations on Page-Rank and HITS Key points are: • mutual reinforcement between hubs and authorities • re-scale edge weights (normalization) Unified notation (for link graph with n nodes): - n × n link matrix, L ij = 1 if there is an edge (i,j), 0 else L - n × 1 vector with din i = indegree(i), Din n × n = diag(din) din - n × 1 vector with dout i = outdegree(i), Dout n × n = diag(dout) dout - n × 1 authority vector x - n × 1 hub vector y Iop - operation applied to incoming links Oop - operation applied to outgoing links 5-2 IRDM WS 2005
HITS and PageRank in Unified Framework HITS: x = Iop(y), y=Oop(x) with Iop(y) = L T y , Oop(x) = Lx PageRank : x = Iop(x) with Iop(x) = P T x with P T = L T Dout -1 or P T = α L T Dout -1 + (1- α ) (1/n) e e T SALSA (PageRank-style computation with mutual reinforcement): x = Iop(y) with Iop(y) = P T y with P T = L T Dout -1 y = Oop(x) with Oop(x) = Q x with Q = L Din -1 and other models of link analysis can be cast into this framework, too 5-3 IRDM WS 2005
A Familiy of Link Analysis Methods General scheme: Iop( ⋅ ) = Din -p L T Dout -q ( ⋅ ) and Oop( ⋅ ) = Iop T ( ⋅ ) Specific instance Out-link normalized Rank (Onorm-Rank) : Iop( ⋅ ) = L T Dout -1/2 ( ⋅ ) , Oop( ⋅ ) = Dout -1/2 L ( ⋅ ) applied to x and y: x = Iop(y), y = Oop(x) In-link normalized Rank (Inorm-Rank) : Iop( ⋅ ) = Din -1/2 L T ( ⋅ ) , Oop( ⋅ ) = L Din -1/2 ( ⋅ ) Symmetric normalized Rank (Snorm-Rank) : Iop( ⋅ ) = Din -1/2 L T Dout -1/2 ( ⋅ ) , Oop( ⋅ ) = Dout -1/2 L Din -1/2 ( ⋅ ) Some properties of Snorm-Rank: x = Iop(y) = Iop(Oop(x)) → λ x = A (S) x with A (S) = Din -1/2 L T Dout -1 L Din -1/2 → Solution: λ = 1, x = din 1/2 and analogously for hub scores: λ y = H (S) y → λ =1, y = dout 1/2 5-4 IRDM WS 2005
Experimental Results Construct neighborhood graph from result of query "star" Compare authority-scoring ranks HITS OnormRank PageRank 1 www.starwars.com www.starwars.com www.starwars.com 2 www.lucasarts.com www.lucasarts.com www.lucasarts.com 3 www.jediknight.net www.jediknight.net www.paramount.com 4 www.sirstevesguide.com www.paramount.com www.4starads.com/romanc 5 www.paramount.com www.sirstevesguide.com www.starpages.net 6 www.surfthe.net/swma/ www.surfthe.net/swma/ www.dailystarnews.com 7 insurrection.startrek.com insurrection.startrek.com www.state.mn.us 8 www.startrek.com www.fanfix.com www.star-telegram.com 9 www.fanfix.com shop.starwars.com www.starbulletin.com 10 www.physics.usyd.edu.au/ www.physics.usyd.edu.au/ www.kansascity.com .../starwars .../starwars ... Bottom line: 19 www.jediknight.net Differences between all kinds of authority 21 insurrection.startrek.com 23 www.surfthe.net/swma ranking methods are fairly minor ! 5-5 IRDM WS 2005
More LAR (Link Analysis Ranking) Methods HubAveraging (similar to ONorm for hubs): 1 = a q h p = h p a q ( ) ( ) ( ) ( ) ∑ ∈ ∑ ∈ p IN q OUT p q OUT p ( ) ( ) | ( ) | AuthorityThreshold (only k best authorities per hub): 1 = a q h p = h p a q ( ) ( ) ( ) ( ) ∑ ∈ ∑ p IN q k ( ) ∈ − q OUT k p ( ) − = − ∈ p a q q OUT p OUT k ( ) argmax k { ( ) | ( )} with q Max (AuthorityThreshold with k=1): = a q h p = ∈ h p a a q q OUT p ( ) ( ) ( ) ( argmax { ( ) | ( )}) ∑ ∈ p IN q q ( ) BreadthFirstSearch (transitive citations up to depth k): j − 1 k where N (j) (q) are nodes that 1 = j a q N q ( ) ( ) | ( ) | have a path to q by alternating ∑ 2 = j 1 o OUT and i IN steps with j=o+i 5-6 IRDM WS 2005
LAR as Bayesian Learning + h a e exp( ) p q p → = Postulate prob. model for p → → → → q: P p q [ ] + + h a e 1 exp( ) p q p with parameters θ θ θ θ = (h 1 , ..., h n , a 1 , ..., a n , e 1 , ..., e n ) Postulate prior f( θ θ ) for parameters θ θ θ θ θ θ : normal distr. ( µ µ µ µ , σ σ ) for each e i , exponential distr. ( λ σ σ λ λ =1) for each a i , h i λ Posterior f( θ θ |G) for links i → → j ∈ ∈ G: θ θ → → ∈ ∈ θ θ θ f G f G f ( | ) ~ ( | ) ( ) Theorem: + + a h e a h e − − − − µ σ h a e 2 2 θ Π ⋅ Π Π + f G e e e ( ) / 2 j i i j i i ( | ) ~ i i i / ( 1 ) = ∈ i n i j G i j 1 .. ( , ) , ˆ θ = E θ G Estimate using numerical algorithms : [ | ] h a p q → = P p q [ ] Alternative simpler model: + h a 1 p q 5-7 IRDM WS 2005
LAR Quality Measures: Score Distances Consider two n-dimensional authority score vectors a and b = α − β d a b a b ( , ) min | | d 1 distance: ∑ = α β ≥ i i 1 , 1 i n 1 .. with scaling weights α , β to compensate normalization distortions could alternatively use Lq norm rather than L1 5-8 IRDM WS 2005
LAR Quality Measures: Rank Distances Consider top-k of two rankings τ 1 and τ 2 or full permutations of 1..n • overlap similarity OSim ( τ 1, τ 2) = | top(k, τ 1) ∩ top(k, τ 2) | / k • Kendall's τ τ τ τ measure KDist ( τ 1, τ 2) = ∈ ≠ τ τ u v u v U u v and disagree on relative order of u v | {( , ) | , , , 1 , 2 , } ⋅ − U U | | (| | 1 ) with U = top(k, τ 1) ∪ top(k, τ 2) (with missing items set to rank k+1) with ties in one ranking and order in the other, count p with 0 ≤ p ≤ 1 → p=0: weak KDist, → p=1: strict KDist 1 τ − τ u u • footrule distance Fdist ( τ 1, τ 2) = | 1 ( ) 2 ( ) | ∑ U | | ∈ u U (normalized) Fdist is upper bound for KDist and Fdist/2 is lower bound 5-9 IRDM WS 2005
LAR Similarity Two LAR algorithms A and B are similar on the class G G of graphs G G with n nodes under authority distance measure d if for n →∞ : max {d(A(G),B(G)) | G ∈ ∈ ∈ ∈ G G } = o(M n (d,L q )) G G where M n (d,L q ) is the maximum distance under d for any two n-dimensional vectors x and y that have L q norm 1 (which is Θ (n1-1/q) for d 1 distance and L q norm) Two LAR algorithms A and B are weakly (strictly) rank-similar on the class G G of graphs with n nodes under weak (strict) rank distance r G G if for n →∞ : max {r(A(G),B(G)) | G ∈ ∈ ∈ G ∈ G } = o(1) G G Theorems: SALSA and Indegree are similar and strictly rank-similar. No other LAR algorithms are known to be similar or weakly rank-similar. 5-10 IRDM WS 2005
LAR Stability For graphs G=(V,E) and G‘=(V,E‘) the link distance d link is: d link (G,G‘) = |(E ∪ ∪ E‘) - (E ∩ ∩ E‘)| ∪ ∪ ∩ ∩ For graph G ∈ G, we define C k (G) = {G‘ ∈ G | d link (G,G‘) ≤ k} LAR algorithm A is stable on the class G of graphs with n nodes under authority distance measure d if for every k > 0 for n →∞ : max {d(A(G),A(G‘)) | G ∈ ∈ G, G, G‘ ∈ ∈ C k (G)} = o(M n (d,L q )) ∈ ∈ ∈ ∈ G, G, LAR algorithm A is weakly (strictly) rank-stable on the class G of graphs with n nodes under weak (strict) rank distance r if for every k > 0 for n →∞ : max {r(A(G),A(G‘)) | G ∈ ∈ ∈ ∈ G, G, G‘ ∈ ∈ ∈ ∈ C k (G)} = o(1) G, G, Theorems: Indegree is stable. No other LAR algorithm is known to be stable or weakly rank-stable (but some are under modified stability definitions). PageRank is stable with high probability for power-law graphs. 5-11 IRDM WS 2005
LAR Experimental Comparison: Queries Experimental setup: • 34 queries • rootsets of 200 pages each obtained from Google • basesets computed using Google with first 50 predecessors per page Source: Borodin et al., ACM TOIT 2005 5-12 IRDM WS 2005
LAR Experimental Comparison: Precision@10 Source: Borodin et al., ACM TOIT 2005 5-13 IRDM WS 2005
LAR Experimental Comparison: Key Authorities Is there a winner at all? Source: Borodin et al., ACM TOIT 2005 5-14 IRDM WS 2005
LAR Results for Query „Classical Guitar“ (1) Source: Borodin et al., ACM TOIT 2005 5-15 IRDM WS 2005
LAR Results for Query „Classical Guitar“ (2) Source: Borodin et al., ACM TOIT 2005 5-16 IRDM WS 2005
LAR Results for Query „Classical Guitar“ (3) Source: Borodin et al., ACM TOIT 2005 5-17 IRDM WS 2005
5.4 Topic-specific PageRank [Haveliwala 2003] Given: a (small) set of topics c k , each with a set T k of authorities (taken from a directory such as ODP (www.dmoz.org) or bookmark collection) Key idea : change the PageRank random walk by biasing the random-jump probabilities to the topic authorities T k : = ε + − ε r p A r with A' ij = 1/outdegree(j) for (j,i) ∈ E, 0 else ( 1 ) ' � � � k k k with (p k ) j = 1/|T k | for j ∈ T k , 0 else (instead of p j = 1/n) Approach: 1) Precompute topic-specific Page-Rank vectors r k 2) Classify user query q (incl. query context) w.r.t. each topic c k → probability w k := P[c k | q] w r d ( ) 3) Total authority score of doc d is ∑ k k k 5-18 IRDM WS 2005
Recommend
More recommend