link analysis
play

Link Analysis Stony Brook University CSE545, Fall 2016 The Web , - PowerPoint PPT Presentation

Link Analysis Stony Brook University CSE545, Fall 2016 The Web , circa 1998 The Web , circa 1998 The Web , circa 1998 Match keywords, language ( information retrieval ) Explore directory The Web , circa 1998 Easy to game with term


  1. Link Analysis Stony Brook University CSE545, Fall 2016

  2. The Web , circa 1998

  3. The Web , circa 1998

  4. The Web , circa 1998 Match keywords, language ( information retrieval ) Explore directory

  5. The Web , circa 1998 Easy to game with “term spam” Time-consuming; Match keywords, language ( information retrieval ) Not open-ended Explore directory

  6. Enter PageRank ...

  7. PageRank Key Idea: Consider the citations of the website in addition to keywords.

  8. PageRank Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations?

  9. PageRank Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations? The Web as a directed graph:

  10. PageRank Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations? Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  11. PageRank Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links as votes Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  12. PageRank Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links (citations) as votes But citations from important pages should count more. Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  13. PageRank Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links (citations) as votes But citations from important pages should count more. Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  14. PageRank Key Idea: Consider the citations of the website in addition to keywords. Flow Model: How to compute? in-links (citations) as votes Each page ( j ) has an importance (i.e. rank, r j ) But citations from important pages should count more. ( n j is |out-links|) Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  15. A B PageRank C D How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|) Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  16. A B r A /1 PageRank r B /4 C D r C /2 r D = r A /1 + r B /4 + r C /2 How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|) Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  17. A B PageRank C D How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|) Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  18. A B PageRank C D A system of equations? How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|) Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  19. A B PageRank C D A system of equations? How to compute? Each page ( j ) has an importance (i.e. rank, r j ) ( n j is |out-links|) Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  20. A B PageRank C D A system of equations? How to compute? Provides Each page ( j ) has an importance (i.e. rank, r j ) intuition, but impractical to ( n j is |out-links|) solve at scale. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

  21. A B PageRank C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 “Transition Matrix”, M Innovation 1: What pages would a “random Web surfer” end up at?

  22. A B PageRank C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 “Transition Matrix”, M Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,]

  23. A B PageRank C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 “Transition Matrix”, M Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24]

  24. A B PageRank C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 “Transition Matrix”, M Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  25. A B PageRank C D Power iteration algorithm to \ from A B C D r [0] = [1/N, … , 1/N], Initialize: A 0 1/2 1 0 r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): B 1/3 0 0 1/2 r [t+1] = M·r [t] C 1/3 0 0 1/2 t+=1 D 1/3 1/2 0 0 solution = r [t] “Transition Matrix”, M err_norm(v1, v2) = |v1 - v2| #L1 norm Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  26. A B PageRank C D Power iteration algorithm to \ from A B C D r [0] = [1/N, … , 1/N], Initialize: A 0 1/2 1 0 r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): B 1/3 0 0 1/2 r [t+1] = M·r [t] C 1/3 0 0 1/2 t+=1 D 1/3 1/2 0 0 solution = r [t] “Transition Matrix”, M err_norm(v1, v2) = |v1 - v2| #L1 norm Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  27. As err_norm gets smaller we are moving toward: PageRank r = M·r Power iteration algorithm We are actually just finding the r [0] = [1/N, … , 1/N], Initialize: eigenvector of M. r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r [t+1] = M·r [t] t+=1 solution = r [t] err_norm(v1, v2) = |v1 - v2| #L1 norm Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  28. As err_norm gets smaller we are moving toward: PageRank r = M·r Power iteration algorithm We are actually just finding the r [0] = [1/N, … , 1/N], Initialize: eigenvector of M. r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): x is an r [t+1] = M·r [t] eigenvector of � if: t+=1 A · x = � · x solution = r [t] err_norm(v1, v2) = |v1 - v2| #L1 norm Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  29. As err_norm gets smaller we are moving toward: PageRank r = M·r Power iteration algorithm We are actually just finds the... finding the r [0] = [1/N, … , 1/N], Initialize: eigenvector of M. r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): x is an r [t+1] = M·r [t] eigenvector of � if: t+=1 A · x = � · x solution = r [t] A = 1 since columns of M sum to 1. err_norm(v1, v2) = |v1 - v2| #L1 norm thus, 1r=Mr Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

  30. As err_norm gets smaller we are moving toward: PageRank r = M·r Power iteration algorithm We are actually just finds the... finding the r [0] = [1/N, … , 1/N], Initialize: eigenvector of M. r [-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): x is an r [t+1] = M·r [t] eigenvector of � if: t+=1 A · x = � · x solution = r [t] A = 1 since columns of M sum to 1. err_norm(v1, v2) = |v1 - v2| #L1 norm thus, 1r=Mr Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M 2 ·r = [15/48, 11/48, … ]

Recommend


More recommend