link analysis
play

Link Analysis Paolo Boldi DSI LAW (Laboratory for Web - PowerPoint PPT Presentation

Link Analysis Paolo Boldi DSI LAW (Laboratory for Web Algorithmics) Universit` a degli Studi di Milan Paolo Boldi Link Analysis Ranking, search engines, social networks Ranking is of uttermost importance in IR, search engines and also in


  1. PageRank: the Web-Surfer Metaphor 0 1 2 4 5 6 7 8 9 3 Paolo Boldi Link Analysis

  2. PageRank: the Web-Surfer Metaphor 0 1 2 4 5 6 7 8 9 3 Paolo Boldi Link Analysis

  3. PageRank: the Web-Surfer Metaphor 0 1 2 4 5 6 7 8 9 3 Paolo Boldi Link Analysis

  4. PageRank: the Web-Surfer Metaphor 0 1 2 4 5 6 7 8 9 3 Paolo Boldi Link Analysis

  5. PageRank: the Web-Surfer Metaphor 0 1 4 5 2 6 7 8 9 ? 3 What if (s)he reaches a node with no outlinks (a dangling node )? Paolo Boldi Link Analysis

  6. PageRank: the Web-Surfer Metaphor 0 1 2 4 5 6 7 8 9 3 In that case, (s)he jumps to a random node with probability 1 . Paolo Boldi Link Analysis

  7. PageRank: the Web-Surfer Metaphor 0 1 4 5 2 6 7 8 9 3 The PageRank of a page is the average fraction of time spent by the surfer on that page. Paolo Boldi Link Analysis

  8. What does PageRank depends on? PageRank can be formally defined as the limit distribution of a stochastic process whose states are Web pages. Paolo Boldi Link Analysis

  9. What does PageRank depends on? PageRank can be formally defined as the limit distribution of a stochastic process whose states are Web pages. What does this distribution depend on? (more on all this later) Paolo Boldi Link Analysis

  10. What does PageRank depends on? PageRank can be formally defined as the limit distribution of a stochastic process whose states are Web pages. What does this distribution depend on? (more on all this later) ◮ the web graph G ; Paolo Boldi Link Analysis

  11. What does PageRank depends on? PageRank can be formally defined as the limit distribution of a stochastic process whose states are Web pages. What does this distribution depend on? (more on all this later) ◮ the web graph G ; ◮ the preference vector v ; Paolo Boldi Link Analysis

  12. What does PageRank depends on? PageRank can be formally defined as the limit distribution of a stochastic process whose states are Web pages. What does this distribution depend on? (more on all this later) ◮ the web graph G ; ◮ the preference vector v ; ◮ the dangling-node distribution u ; Paolo Boldi Link Analysis

  13. What does PageRank depends on? PageRank can be formally defined as the limit distribution of a stochastic process whose states are Web pages. What does this distribution depend on? (more on all this later) ◮ the web graph G ; ◮ the preference vector v ; ◮ the dangling-node distribution u ; ◮ the damping factor α . Paolo Boldi Link Analysis

  14. What does PageRank depends on? PageRank can be formally defined as the limit distribution of a stochastic process whose states are Web pages. What does this distribution depend on? (more on all this later) ◮ the web graph G ; ◮ the preference vector v ; ◮ the dangling-node distribution u ; ◮ the damping factor α . How does PageRank depends on each of these factors? What happens at limit values (e.g., α → 1)? Paolo Boldi Link Analysis

  15. PageRank: formal definition Paolo Boldi Link Analysis

  16. PageRank: formal definition ◮ Is the definition of PageRank well-given? Are we all using the same definition? Paolo Boldi Link Analysis

  17. PageRank: formal definition ◮ Is the definition of PageRank well-given? Are we all using the same definition? ◮ The row-normalised matrix of a (web) graph G is the matrix G such that ( ¯ ¯ G ) ij is one over the outdegree of i if there is an arc from i to j in G (in general, and usually, not stochastic because of rows of zeroes). Paolo Boldi Link Analysis

  18. PageRank: formal definition ◮ Is the definition of PageRank well-given? Are we all using the same definition? ◮ The row-normalised matrix of a (web) graph G is the matrix G such that ( ¯ ¯ G ) ij is one over the outdegree of i if there is an arc from i to j in G (in general, and usually, not stochastic because of rows of zeroes). ◮ d is the characteristic vector of dangling nodes (nodes without outgoing arcs). Paolo Boldi Link Analysis

  19. PageRank: formal definition ◮ Is the definition of PageRank well-given? Are we all using the same definition? ◮ The row-normalised matrix of a (web) graph G is the matrix G such that ( ¯ ¯ G ) ij is one over the outdegree of i if there is an arc from i to j in G (in general, and usually, not stochastic because of rows of zeroes). ◮ d is the characteristic vector of dangling nodes (nodes without outgoing arcs). ◮ Let v and u be distributions, which we will call the preference and the dangling-node distribution. Paolo Boldi Link Analysis

  20. PageRank: formal definition ◮ Is the definition of PageRank well-given? Are we all using the same definition? ◮ The row-normalised matrix of a (web) graph G is the matrix G such that ( ¯ ¯ G ) ij is one over the outdegree of i if there is an arc from i to j in G (in general, and usually, not stochastic because of rows of zeroes). ◮ d is the characteristic vector of dangling nodes (nodes without outgoing arcs). ◮ Let v and u be distributions, which we will call the preference and the dangling-node distribution. ◮ Let α be the damping factor . Paolo Boldi Link Analysis

  21. PageRank: formal definition (2) Paolo Boldi Link Analysis

  22. PageRank: formal definition (2) ◮ PageRank r is defined (up to a scalar) by the eigenvector equation � � α ( ¯ G + d T u ) + (1 − α ) 1 T v r = r Paolo Boldi Link Analysis

  23. PageRank: formal definition (2) ◮ PageRank r is defined (up to a scalar) by the eigenvector equation � � α ( ¯ G + d T u ) + (1 − α ) 1 T v r = r ◮ Equivalently, as the unique stationary state of the Markov chain α ( ¯ G + d T u ) + (1 − α ) 1 T v that we call a Markov chain with restart [Boldi, Lonati, Santini & Vigna 2006]. Paolo Boldi Link Analysis

  24. PageRank: formal definition (2) ◮ PageRank r is defined (up to a scalar) by the eigenvector equation � � α ( ¯ G + d T u ) + (1 − α ) 1 T v r = r ◮ Equivalently, as the unique stationary state of the Markov chain α ( ¯ G + d T u ) + (1 − α ) 1 T v that we call a Markov chain with restart [Boldi, Lonati, Santini & Vigna 2006]. ◮ Some notation: � � α ( ¯ G + d T u ) + (1 − α ) 1 T v r = r Paolo Boldi Link Analysis

  25. PageRank: formal definition (2) ◮ PageRank r is defined (up to a scalar) by the eigenvector equation � � α ( ¯ G + d T u ) + (1 − α ) 1 T v r = r ◮ Equivalently, as the unique stationary state of the Markov chain α ( ¯ G + d T u ) + (1 − α ) 1 T v that we call a Markov chain with restart [Boldi, Lonati, Santini & Vigna 2006]. ◮ Some notation: � � α P + (1 − α ) 1 T v r = r Paolo Boldi Link Analysis

  26. PageRank: formal definition (2) ◮ PageRank r is defined (up to a scalar) by the eigenvector equation � � α ( ¯ G + d T u ) + (1 − α ) 1 T v r = r ◮ Equivalently, as the unique stationary state of the Markov chain α ( ¯ G + d T u ) + (1 − α ) 1 T v that we call a Markov chain with restart [Boldi, Lonati, Santini & Vigna 2006]. ◮ Some notation: � � α P + (1 − α ) 1 T v r = r Paolo Boldi Link Analysis

  27. PageRank: formal definition (2) ◮ PageRank r is defined (up to a scalar) by the eigenvector equation � � α ( ¯ G + d T u ) + (1 − α ) 1 T v r = r ◮ Equivalently, as the unique stationary state of the Markov chain α ( ¯ G + d T u ) + (1 − α ) 1 T v that we call a Markov chain with restart [Boldi, Lonati, Santini & Vigna 2006]. ◮ Some notation: � � α P + (1 − α ) 1 T v r = r Paolo Boldi Link Analysis

  28. PageRank: formal definition (2) ◮ PageRank r is defined (up to a scalar) by the eigenvector equation � � α ( ¯ G + d T u ) + (1 − α ) 1 T v r = r ◮ Equivalently, as the unique stationary state of the Markov chain α ( ¯ G + d T u ) + (1 − α ) 1 T v that we call a Markov chain with restart [Boldi, Lonati, Santini & Vigna 2006]. ◮ Some notation: r M = r Paolo Boldi Link Analysis

  29. PageRank closed formula Paolo Boldi Link Analysis

  30. PageRank closed formula Fixing r 1 T = 1, rM = r � � α P + (1 − α ) 1 T v r = r α rP + (1 − α ) v = r (1 − α ) v = r ( I − α P ) , Paolo Boldi Link Analysis

  31. PageRank closed formula Fixing r 1 T = 1, rM = r � � α P + (1 − α ) 1 T v r = r α rP + (1 − α ) v = r (1 − α ) v = r ( I − α P ) , . . . which yields the following closed formula for PageRank: r = (1 − α ) v (1 − α P ) − 1 . Paolo Boldi Link Analysis

  32. PageRank closed formula Fixing r 1 T = 1, rM = r � � α P + (1 − α ) 1 T v r = r α rP + (1 − α ) v = r (1 − α ) v = r ( I − α P ) , . . . which yields the following closed formula for PageRank: r = (1 − α ) v (1 − α P ) − 1 . So it’s a linear system— use Gauss–Seidel! Paolo Boldi Link Analysis

  33. PageRank closed formula Fixing r 1 T = 1, rM = r � � α P + (1 − α ) 1 T v r = r α rP + (1 − α ) v = r (1 − α ) v = r ( I − α P ) , . . . which yields the following closed formula for PageRank: r = (1 − α ) v (1 − α P ) − 1 . So it’s a linear system— use Gauss–Seidel! Or use the Power Iteration Method: k →∞ x · M k lim Paolo Boldi Link Analysis

  34. PageRank closed formula Fixing r 1 T = 1, rM = r � � α P + (1 − α ) 1 T v r = r α rP + (1 − α ) v = r (1 − α ) v = r ( I − α P ) , . . . which yields the following closed formula for PageRank: r = (1 − α ) v (1 − α P ) − 1 . So it’s a linear system— use Gauss–Seidel! Or use the Power Iteration Method: k →∞ x · M k lim Equivalently: ∞ � ( α P ) k . r = (1 − α ) v k =0 Paolo Boldi Link Analysis

  35. PageRank and graph paths ◮ Let G ∗ ( − , i ) be the set of all paths ending into i ; Paolo Boldi Link Analysis

  36. PageRank and graph paths ◮ Let G ∗ ( − , i ) be the set of all paths ending into i ; ◮ For any π ∈ G ∗ ( − , i ), let b ( π ) denote the branching contribution of π , i.e., the product of outdegrees of the nodes that are met on the path (excluding the ending node); Paolo Boldi Link Analysis

  37. PageRank and graph paths ◮ Let G ∗ ( − , i ) be the set of all paths ending into i ; ◮ For any π ∈ G ∗ ( − , i ), let b ( π ) denote the branching contribution of π , i.e., the product of outdegrees of the nodes that are met on the path (excluding the ending node); ◮ The expression ∞ � ( α P ) k , r = (1 − α ) v k =0 can be rewritten as � v s ( π ) b ( π ) α | π | ( r ) i = (1 − α ) π ∈ G ∗ ( − , i ) Paolo Boldi Link Analysis

  38. PageRank and graph paths ◮ Let G ∗ ( − , i ) be the set of all paths ending into i ; ◮ For any π ∈ G ∗ ( − , i ), let b ( π ) denote the branching contribution of π , i.e., the product of outdegrees of the nodes that are met on the path (excluding the ending node); ◮ The expression ∞ � ( α P ) k , r = (1 − α ) v k =0 can be rewritten as � v s ( π ) b ( π ) α | π | ( r ) i = (1 − α ) π ∈ G ∗ ( − , i ) Paolo Boldi Link Analysis

  39. PageRank and graph paths ◮ Let G ∗ ( − , i ) be the set of all paths ending into i ; ◮ For any π ∈ G ∗ ( − , i ), let b ( π ) denote the branching contribution of π , i.e., the product of outdegrees of the nodes that are met on the path (excluding the ending node); ◮ The expression ∞ � ( α P ) k , r = (1 − α ) v k =0 can be rewritten as � v s ( π ) ( r ) i = (1 − α ) b ( π ) f ( | π | ) π ∈ G ∗ ( − , i ) where f ( − ) is a suitable damping function that goes to zero sufficiently fast [Baeza–Yates, Boldi & Castillo 2006]. Paolo Boldi Link Analysis

  40. Iteration vs. approximation Paolo Boldi Link Analysis

  41. Iteration vs. approximation We can rewrite the summation as follows: ∞ � α k � P k − P k − 1 � r = v + v . k =1 Paolo Boldi Link Analysis

  42. Iteration vs. approximation We can rewrite the summation as follows: ∞ � α k � P k − P k − 1 � r = v + v . k =1 Thus, the rational function r can be approximated using its Maclaurin polynomials (i.e., truncated series). Paolo Boldi Link Analysis

  43. Iteration vs. approximation We can rewrite the summation as follows: ∞ � α k � P k − P k − 1 � r = v + v . k =1 Thus, the rational function r can be approximated using its Maclaurin polynomials (i.e., truncated series). Theorem The n-th approximation of PageRank computed by the Power Method with damping factor α and starting vector v coincides with the n-th degree Maclaurin polynomial of PageRank evaluated in α . n � α k � P k − P k − 1 � vM n = v + v . k =1 Paolo Boldi Link Analysis

  44. One α to rule them all. . . Paolo Boldi Link Analysis

  45. One α to rule them all. . . Corollary The difference between the k-th and the ( k − 1) -th approximation of PageRank (as computed by the Power Method with starting vector v), divided by α k , is the k-th coefficient of the power series of PageRank. Paolo Boldi Link Analysis

  46. One α to rule them all. . . Corollary The difference between the k-th and the ( k − 1) -th approximation of PageRank (as computed by the Power Method with starting vector v), divided by α k , is the k-th coefficient of the power series of PageRank. As a consequence the data obtained computing PageRank for a given α can be used to compute immediately PageRank for any other α , obtaining the result of the Power Method after the same number of iterations. Paolo Boldi Link Analysis

  47. One α to rule them all. . . Corollary The difference between the k-th and the ( k − 1) -th approximation of PageRank (as computed by the Power Method with starting vector v), divided by α k , is the k-th coefficient of the power series of PageRank. As a consequence the data obtained computing PageRank for a given α can be used to compute immediately PageRank for any other α , obtaining the result of the Power Method after the same number of iterations. By saving the Maclaurin coefficients during the computation of PageRank with a specific α it is possible to study the behaviour of PageRank when α varies. Paolo Boldi Link Analysis

  48. One α to rule them all. . . Corollary The difference between the k-th and the ( k − 1) -th approximation of PageRank (as computed by the Power Method with starting vector v), divided by α k , is the k-th coefficient of the power series of PageRank. As a consequence the data obtained computing PageRank for a given α can be used to compute immediately PageRank for any other α , obtaining the result of the Power Method after the same number of iterations. By saving the Maclaurin coefficients during the computation of PageRank with a specific α it is possible to study the behaviour of PageRank when α varies. Even more is true, of course: using standard series derivation techniques, one can approximate the k -th derivative. Paolo Boldi Link Analysis

  49. Some typical behaviours 1e-07 2.6e-08 2.4e-08 9e-08 2.2e-08 8e-08 2e-08 7e-08 1.8e-08 6e-08 1.6e-08 5e-08 1.4e-08 4e-08 1.2e-08 3e-08 1e-08 2e-08 8e-09 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 5.5e-08 3.5e-08 5e-08 3e-08 4.5e-08 2.5e-08 4e-08 3.5e-08 2e-08 3e-08 1.5e-08 2.5e-08 1e-08 2e-08 1.5e-08 5e-09 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Paolo Boldi Link Analysis

  50. An example 0.5 node 0 node 1 node 2 node 3 0.4 node 4 0 1 node 5 node 6 node 7 node 8 0.3 node 9 2 4 5 6 7 8 9 0.2 3 0.1 0 0 0.2 0.4 0.6 0.8 1 � � α 2 + 18 α + 4 ( − 1 + α ) r 0 ( α ) = − 5 8 α 4 + α 3 − 170 α 2 − 20 α + 200 Paolo Boldi Link Analysis

  51. An example 0.5 node 0 node 1 node 2 node 3 0.4 node 4 0 1 node 5 node 6 node 7 node 8 0.3 node 9 2 4 5 6 7 8 9 0.2 3 0.1 0 0 0.2 0.4 0.6 0.8 1 � � α 2 + 2 α + 10 ( − 1 + α ) r 1 ( α ) = − 2 8 α 4 + α 3 − 170 α 2 − 20 α + 200 Paolo Boldi Link Analysis

  52. The magic value α = 0 . 85 One usually computes and considers only r (0 . 85). Why 0.85? Paolo Boldi Link Analysis

  53. The magic value α = 0 . 85 One usually computes and considers only r (0 . 85). Why 0.85? ◮ “The smart guys at Google use 0.85” (???). Paolo Boldi Link Analysis

  54. The magic value α = 0 . 85 One usually computes and considers only r (0 . 85). Why 0.85? ◮ “The smart guys at Google use 0.85” (???). ◮ “It works pretty well”. Paolo Boldi Link Analysis

  55. The magic value α = 0 . 85 One usually computes and considers only r (0 . 85). Why 0.85? ◮ “The smart guys at Google use 0.85” (???). ◮ “It works pretty well”. ◮ Iterative algorithms that approximate PageRank converge quickly if α = 0 . 85: larger values would require more iterations; moreover. . . Paolo Boldi Link Analysis

  56. The magic value α = 0 . 85 One usually computes and considers only r (0 . 85). Why 0.85? ◮ “The smart guys at Google use 0.85” (???). ◮ “It works pretty well”. ◮ Iterative algorithms that approximate PageRank converge quickly if α = 0 . 85: larger values would require more iterations; moreover. . . ◮ . . . numeric instability arises when α is too close to 1. . . Paolo Boldi Link Analysis

  57. The magic value α = 0 . 85 One usually computes and considers only r (0 . 85). Why 0.85? ◮ “The smart guys at Google use 0.85” (???). ◮ “It works pretty well”. ◮ Iterative algorithms that approximate PageRank converge quickly if α = 0 . 85: larger values would require more iterations; moreover. . . ◮ . . . numeric instability arises when α is too close to 1. . . ◮ . . . yet, we believe that understanding how r ( α ) changes when α is modified is important. Paolo Boldi Link Analysis

  58. Some literature Paolo Boldi Link Analysis

  59. Some literature ◮ PageRank (values and rankings) change significantly when α is modified [Pretto 2002; Langville & Meyer 2004]. Paolo Boldi Link Analysis

  60. Some literature ◮ PageRank (values and rankings) change significantly when α is modified [Pretto 2002; Langville & Meyer 2004]. ◮ Convergence rate of the Power Method is α [Haveliwala & Kamvar 2003]. Paolo Boldi Link Analysis

  61. Some literature ◮ PageRank (values and rankings) change significantly when α is modified [Pretto 2002; Langville & Meyer 2004]. ◮ Convergence rate of the Power Method is α [Haveliwala & Kamvar 2003]. ◮ The condition number of the PageRank problem is (1 + α ) / (1 − α ) [Haveliwala & Kamvar 2003]. Paolo Boldi Link Analysis

  62. Some literature ◮ PageRank (values and rankings) change significantly when α is modified [Pretto 2002; Langville & Meyer 2004]. ◮ Convergence rate of the Power Method is α [Haveliwala & Kamvar 2003]. ◮ The condition number of the PageRank problem is (1 + α ) / (1 − α ) [Haveliwala & Kamvar 2003]. ◮ PageRank can be computed in the α ≈ 1 zone using Arnoldi-type methods [Del Corso, Gull` ı & Romani 2005; Golub & Grief 2006]. Paolo Boldi Link Analysis

  63. Some literature ◮ PageRank (values and rankings) change significantly when α is modified [Pretto 2002; Langville & Meyer 2004]. ◮ Convergence rate of the Power Method is α [Haveliwala & Kamvar 2003]. ◮ The condition number of the PageRank problem is (1 + α ) / (1 − α ) [Haveliwala & Kamvar 2003]. ◮ PageRank can be computed in the α ≈ 1 zone using Arnoldi-type methods [Del Corso, Gull` ı & Romani 2005; Golub & Grief 2006]. ◮ PageRank can be extrapolated when α ≈ 1 (even α > 1!) using an explicit formula based on the Jordan normal form [Serra–Capizzano 2005; Brezinski & Redivo–Zaglia 2006] Paolo Boldi Link Analysis

  64. Some literature ◮ PageRank (values and rankings) change significantly when α is modified [Pretto 2002; Langville & Meyer 2004]. ◮ Convergence rate of the Power Method is α [Haveliwala & Kamvar 2003]. ◮ The condition number of the PageRank problem is (1 + α ) / (1 − α ) [Haveliwala & Kamvar 2003]. ◮ PageRank can be computed in the α ≈ 1 zone using Arnoldi-type methods [Del Corso, Gull` ı & Romani 2005; Golub & Grief 2006]. ◮ PageRank can be extrapolated when α ≈ 1 (even α > 1!) using an explicit formula based on the Jordan normal form [Serra–Capizzano 2005; Brezinski & Redivo–Zaglia 2006] ◮ Choose α = 1 / 2! [Avrachenkov, Litvak & Kim 2006] Paolo Boldi Link Analysis

Recommend


More recommend