link analysis
play

Link Analysis Stony Brook University CSE545, Spring 2019 The Web , - PowerPoint PPT Presentation

Link Analysis Stony Brook University CSE545, Spring 2019 The Web , circa 1998 The Web , circa 1998 Match keywords, language ( information retrieval ) Explore directory The Web , circa 1998 Easy to game with term spam Time-consuming;


  1. Link Analysis Stony Brook University CSE545, Spring 2019

  2. The Web , circa 1998

  3. The Web , circa 1998 Match keywords, language ( information retrieval ) Explore directory

  4. The Web , circa 1998 Easy to game with “term spam” Time-consuming; Match keywords, language ( information retrieval ) Not open-ended Explore directory

  5. Enter PageRank ...

  6. PageRank Key Idea: Consider the citations of the website.

  7. PageRank Key Idea: Consider the citations of the website. Who links to it? and what are their citations?

  8. PageRank Key Idea: Consider the citations of the website. Who links to it? and what are their citations? Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  9. PageRank A B C View 1: Flow Model: in-links as votes D E F Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  10. PageRank View 1: Flow Model: in-links as votes Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  11. PageRank View 1: Flow Model: in-links (citations) as votes but, citations from important pages should count more. => Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  12. A B PageRank View 1: Flow Model: C D How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  13. A B PageRank r A /1 r B /4 View 1: Flow Model: C D r C /2 r D = r A /1 + r B /4 + r C /2 How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  14. PageRank A B View 1: Flow Model: C D How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  15. PageRank A B View 1: Flow Model: C D A System of Equations: How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  16. PageRank A B View 1: Flow Model: C D A System of Equations: How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  17. PageRank A B View 1: Flow Model: Solve C D How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|)

  18. PageRank A B C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  19. View 2: Matrix Formulation A B C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  20. Innovation: What pages would a “random Web surfer” end up at? A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  21. Innovation: What pages would a “random Web surfer” end up at? To Start, all are equally likely at ¼ A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  22. Innovation: What pages would a “random Web surfer” end up at? To Start, all are equally likely at ¼: ends up at D A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  23. Innovation: What pages would a “random Web surfer” end up at? To Start, all are equally likely at ¼: ends up at D C and B are then equally likely: ->D->B=¼*½; ->D->C=¼*½ A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  24. Innovation: What pages would a “random Web surfer” end up at? To Start, all are equally likely at ¼: ends up at D C and B are then equally likely: ->D->B=¼*½; ->D->C=¼*½ Ends up at C: then A is only option: ->D->C->A = ¼*½*1 A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  25. Innovation: What pages would a “random Web surfer” end up at? ... A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  26. Innovation: What pages would a “random Web surfer” end up at? ... A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  27. Innovation: What pages would a “random Web surfer” end up at? ... A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  28. Innovation: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  29. Innovation: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after 1st iteration: M·r = [3/8, 5/24, 5/24, 5/24] A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  30. Innovation: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after 1st iteration: M·r = [3/8, 5/24, 5/24, 5/24] after 2nd iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ] A B View 2: Matrix Formulation C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

  31. Innovation: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after 1st iteration: M·r = [3/8, 5/24, 5/24, 5/24] after 2nd iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ] A B Power iteration algorithm C D r [0] = [1/N, … , 1/N], initialize: r [-1]=[0,...,0] to \ from A B C D while (err_norm( r[t] , r[t-1] )>min_err): A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 err_norm( v1, v2 ) = | v1 - v2 | #L1 norm “Transition Matrix”, M

  32. Innovation: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after 1st iteration: M·r = [3/8, 5/24, 5/24, 5/24] after 2nd iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ] A B Power iteration algorithm C D r [0] = [1/N, … , 1/N], initialize: r [-1]=[0,...,0] to \ from A B C D while (err_norm( r[t] , r[t-1] )>min_err): A 0 1/2 1 0 r [t+1] = M·r [t] B 1/3 0 0 1/2 t+=1 solution = r [t] C 1/3 0 0 1/2 D 1/3 1/2 0 0 err_norm( v1, v2 ) = | v1 - v2 | #L1 norm “Transition Matrix”, M

  33. As err_norm gets smaller we are moving toward: r = M·r View 3: Eigenvectors: Power iteration algorithm r [0] = [1/N, … , 1/N], initialize: r [-1]=[0,...,0] while (err_norm( r[t] , r[t-1] )>min_err): r [t+1] = M·r [t] t+=1 solution = r [t] err_norm( v1, v2 ) = | v1 - v2 | #L1 norm

  34. As err_norm gets smaller we are moving toward: r = M·r View 3: Eigenvectors: We are actually just finding the eigenvector of M. . . . e h t s d n i f Power iteration algorithm x is an r [0] = [1/N, … , 1/N], initialize: eigenvector of A if: r [-1]=[0,...,0] A · x = 𝛍 · x while (err_norm( r[t] , r[t-1] )>min_err): r [t+1] = M·r [t] t+=1 solution = r [t] err_norm( v1, v2 ) = | v1 - v2 | #L1 norm (Leskovec at al., 2014; http://www.mmds.org/)

  35. As err_norm gets smaller we are moving toward: r = M·r View 3: Eigenvectors: We are actually just finding the eigenvector of M. . . . e h t s d n i f Power iteration algorithm x is an r [0] = [1/N, … , 1/N], initialize: eigenvector of A if: r [-1]=[0,...,0] A · x = 𝛍 · x while (err_norm( r[t] , r[t-1] )>min_err): r [t+1] = M·r [t] 𝛍 = 1 (eigenvalue for 1st principal eigenvector) t+=1 since columns of M sum to 1. solution = r [t] Thus, if r is x , then Mr=1r err_norm( v1, v2 ) = sum(| v1 - v2 |) #L1 norm

  36. View 4: Markov Process Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk . Thus, r is a stationary distribution. Probability of being at given node.

  37. View 4: Markov Process Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk . Thus, r is a stationary distribution. Probability of being at given node. aka 1st order Markov Process ● Rich probabilistic theory. One finding: ○ Stationary distributions have a unique distribution if: ■ No “dead-ends” : a node can’t propagate its rank ■ No “spider traps” : set of nodes with no way out. Also known as being stochastic , irreducible , and aperiodic.

Recommend


More recommend