http cs224w stanford edu how to organize navigate it
play

http://cs224w.stanford.edu How to organize/navigate it? First try: - PowerPoint PPT Presentation

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize/navigate it? First try: Human curated Web directories Yahoo, DMOZ, LookSmart 11/8/2011 Jure Leskovec,


  1. CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu

  2.  How to organize/navigate it?  First try: Human curated Web directories  Yahoo,  DMOZ,  LookSmart 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

  3.  SEARCH!  Find relevant docs in a small and trusted set:  Newspaper articles  Patents, etc.  Two traditional problems:  Synonimy: buy – purchase, sick – ill  Polysemi: jaguar 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

  4. Does more documents mean better results? 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

  5.  What is “best” answer to query “Stanford”?  Anchor Text: I go to Stanford where I study  What about query “newspaper”?  No single right answer  Scarcity (IR) vs. abundance (Web) of information  Web: Many sources of information. Who to “trust”?  Trick:  Pages that actually know about newspapers might all be pointing to many newspapers  Ranking! 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

  6. the “golden triangle” 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

  7.  Web pages are not equally “important”  www.joe ‐ schmoe.com vs. www.stanford.edu  We already know: Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

  8.  We will cover the following Link Analysis approaches to computing importances of nodes in a graph:  Hubs and Authorities (HITS)  Page Rank  Topic ‐ Specific (Personalized) Page Rank Sidenote: Various notions of node centrality: Node u  Degree dentrality = degree of u  Betweenness centrality = #shortest paths passing through u  Closeness centrality = avg. length of shortest paths from u to all other nodes  Eigenvector centrality = like PageRank 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

  9.  Goal (back to the newspaper example):  Don’t just find newspapers.Find “experts” – people who link in a coordinated way to good newspapers  Idea: Links as votes  Page is more important if it has more links  In ‐ coming links? Out ‐ going links?  Hubs and Authorities NYT: 10 Each page has 2 scores: Ebay: 3  Quality as an expert (hub):  Total sum of votes of pages pointed to Yahoo: 3  Quality as an content (authority): CNN: 8  Total sum of votes of experts  Principle of repeated improvement WSJ: 9 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

  10. Interesting pages fall into two classes: 1. Authorities are pages containing useful information  Newspaper home pages  Course home pages  Home pages of auto manufacturers 2. Hubs are pages that link to authorities  List of newspapers NYT: 10 Ebay: 3  Course bulletin Yahoo: 3  CNN: 8 List of US auto manufacturers WSJ: 9 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

  11. 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

  12. 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

  13. 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

  14.  A good hub links to many good authorities  A good authority is linked from many good hubs  Model using two scores for each node:  Hub score and Authority score  Represented as vectors h and a 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

  15. [Kleinberg ‘98] j 1 j 2 j 3 j 4  Each page i has 2 scores:  Authority score: �  Hub score: � i � � � � � � HITS algorithm: �→�  Initialize: � �  Then keep iterating: i  Authority: � � �→�  Hub: � � �→� j 1 j 2 j 3 j 4  normalize: , � � � � � � � � � � �→� 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

  16. [Kleinberg ‘98]  HITS converges to a single stable point  Slightly change the notation:  Vector a = (a 1 …,a n ), h = (h 1 …,h n )  Adjacency matrix ( n x n ): M ij =1 if i  j  Then:      h a h M a i j i ij j  i j j h  Ma  So:  T a M h  And likewise: 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16

  17.  HITS algorithm in new notation:  Set: a = h = 1 n  Repeat:  h=Ma, a=M T h  Normalize  Then: a=M T (Ma) new h a is being updated (in 2 steps): new a M T (M a)=(M T M) a  Thus, in 2k steps: h is updated (in 2 steps): a=(M T M) k a M (M T h)=(MM T ) h h=(M M T ) k h Repeated matrix powering 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

  18.  Definition:  Let Ax=  x for some scalar  , vector x , matrix A  Then x is an eigenvector, and  is its eigenvalue  Fact:  If A is symmetric ( A ij =A ji ) (in our case M T M and M M T are symmetric)  Then A has n orthogonal unit eigenvectors w 1 …w n that form a basis (coordinate system) with eigenvalues  1 ...  n (|  i |  |  i+1 |) 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

  19.  Let’s write x in coordinate system w 1 …w n x=  i  i w i  x has coordinates (  1 ,…,  n )  Suppose:  1 ...  n (|  1 |  …  |  n |) k  i w i  A k x =  k x =  i  i �� � ��  As k  , if we normalize A k x   1  1 w 1 � � � � � � lim � � → ∞ � � �→� � � (contribution of all other coordinates  0)  So authority a is eigenvector of M T M associated with largest eigenvalue  1  Similarly: hub h is eigenvector of M M T 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

  20.  A “vote” from an important The web in 1839 page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r j for node j m a m a/2 r   r i Flow equations: j d out (i) r y = r y /2 + r a /2  i j r a = r y /2 + r m r m = r a /2 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

  21.  Stochastic adjacency matrix M j  Let page j has d j out ‐ links  If j → i , then M ij = 1/d j else M ij = 0  M is a column stochastic matrix i � �� � 1  Columns sum to 1 3  Rank vector r : vector with an entry per page  r i is the importance score of page i   i r i = 1  The flow equations can be written r = M r 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

  22.  Imagine a random web surfer:  At any time t , surfer is on some page u  At time t+1 , the surfer follows an out ‐ link from u uniformly at random  Ends up on some page v linked from u  Process repeats indefinitely  Let:  p (t) … vector whose i th coordinate is the prob. that the surfer is at page i at time t  p (t) is a probability distribution over pages 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

  23.  Where is the surfer at time t+1 ?  Follows a link uniformly at random p (t+1) = Mp (t)  Suppose the random walk reaches a state p (t+1) = Mp (t) = p (t) then p (t) is stationary distribution of a random walk  Our rank vector r satisfies r = Mr  So, it is a stationary distribution for the random walk 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

  24. Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Assign each node an initial page rank  Repeat until convergence  calculate the page rank of each node t   ( ) r  t ( 1 ) r i j d  i j i d i …. out-degree of node i 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

  25. y a m  Power Iteration: y y ½ ½ 0  Set � a ½ 0 1 a m � � m 0 ½ 0  � �→� � � r y = r y /2 + r a /2  And iterate r a = r y /2 + r m r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

Recommend


More recommend