CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu
How to organize/navigate it? First try: Human curated Web directories Yahoo, DMOZ, LookSmart 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
SEARCH! Find relevant docs in a small and trusted set: Newspaper articles Patents, etc. Two traditional problems: Synonimy: buy – purchase, sick – ill Polysemi: jaguar 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
Does more documents mean better results? 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
What is “best” answer to query “Stanford”? Anchor Text: I go to Stanford where I study What about query “newspaper”? No single right answer Scarcity (IR) vs. abundance (Web) of information Web: Many sources of information. Who to “trust”? Trick: Pages that actually know about newspapers might all be pointing to many newspapers Ranking! 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
the “golden triangle” 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Web pages are not equally “important” www.joe ‐ schmoe.com vs. www.stanford.edu We already know: Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
We will cover the following Link Analysis approaches to computing importances of nodes in a graph: Hubs and Authorities (HITS) Page Rank Topic ‐ Specific (Personalized) Page Rank Sidenote: Various notions of node centrality: Node u Degree dentrality = degree of u Betweenness centrality = #shortest paths passing through u Closeness centrality = avg. length of shortest paths from u to all other nodes Eigenvector centrality = like PageRank 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Goal (back to the newspaper example): Don’t just find newspapers.Find “experts” – people who link in a coordinated way to good newspapers Idea: Links as votes Page is more important if it has more links In ‐ coming links? Out ‐ going links? Hubs and Authorities NYT: 10 Each page has 2 scores: Ebay: 3 Quality as an expert (hub): Total sum of votes of pages pointed to Yahoo: 3 Quality as an content (authority): CNN: 8 Total sum of votes of experts Principle of repeated improvement WSJ: 9 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
Interesting pages fall into two classes: 1. Authorities are pages containing useful information Newspaper home pages Course home pages Home pages of auto manufacturers 2. Hubs are pages that link to authorities List of newspapers NYT: 10 Ebay: 3 Course bulletin Yahoo: 3 CNN: 8 List of US auto manufacturers WSJ: 9 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node: Hub score and Authority score Represented as vectors h and a 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
[Kleinberg ‘98] j 1 j 2 j 3 j 4 Each page i has 2 scores: Authority score: � Hub score: � i � � � � � � HITS algorithm: �→� Initialize: � � Then keep iterating: i Authority: � � �→� Hub: � � �→� j 1 j 2 j 3 j 4 normalize: , � � � � � � � � � � �→� 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
[Kleinberg ‘98] HITS converges to a single stable point Slightly change the notation: Vector a = (a 1 …,a n ), h = (h 1 …,h n ) Adjacency matrix ( n x n ): M ij =1 if i j Then: h a h M a i j i ij j i j j h Ma So: T a M h And likewise: 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16
HITS algorithm in new notation: Set: a = h = 1 n Repeat: h=Ma, a=M T h Normalize Then: a=M T (Ma) new h a is being updated (in 2 steps): new a M T (M a)=(M T M) a Thus, in 2k steps: h is updated (in 2 steps): a=(M T M) k a M (M T h)=(MM T ) h h=(M M T ) k h Repeated matrix powering 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
Definition: Let Ax= x for some scalar , vector x , matrix A Then x is an eigenvector, and is its eigenvalue Fact: If A is symmetric ( A ij =A ji ) (in our case M T M and M M T are symmetric) Then A has n orthogonal unit eigenvectors w 1 …w n that form a basis (coordinate system) with eigenvalues 1 ... n (| i | | i+1 |) 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
Let’s write x in coordinate system w 1 …w n x= i i w i x has coordinates ( 1 ,…, n ) Suppose: 1 ... n (| 1 | … | n |) k i w i A k x = k x = i i �� � �� As k , if we normalize A k x 1 1 w 1 � � � � � � lim � � → ∞ � � �→� � � (contribution of all other coordinates 0) So authority a is eigenvector of M T M associated with largest eigenvalue 1 Similarly: hub h is eigenvector of M M T 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
A “vote” from an important The web in 1839 page is worth more y/2 A page is important if it is y pointed to by other important a/2 pages y/2 Define a “rank” r j for node j m a m a/2 r r i Flow equations: j d out (i) r y = r y /2 + r a /2 i j r a = r y /2 + r m r m = r a /2 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
Stochastic adjacency matrix M j Let page j has d j out ‐ links If j → i , then M ij = 1/d j else M ij = 0 M is a column stochastic matrix i � �� � 1 Columns sum to 1 3 Rank vector r : vector with an entry per page r i is the importance score of page i i r i = 1 The flow equations can be written r = M r 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Imagine a random web surfer: At any time t , surfer is on some page u At time t+1 , the surfer follows an out ‐ link from u uniformly at random Ends up on some page v linked from u Process repeats indefinitely Let: p (t) … vector whose i th coordinate is the prob. that the surfer is at page i at time t p (t) is a probability distribution over pages 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
Where is the surfer at time t+1 ? Follows a link uniformly at random p (t+1) = Mp (t) Suppose the random walk reaches a state p (t+1) = Mp (t) = p (t) then p (t) is stationary distribution of a random walk Our rank vector r satisfies r = Mr So, it is a stationary distribution for the random walk 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Assign each node an initial page rank Repeat until convergence calculate the page rank of each node t ( ) r t ( 1 ) r i j d i j i d i …. out-degree of node i 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
y a m Power Iteration: y y ½ ½ 0 Set � a ½ 0 1 a m � � m 0 ½ 0 � �→� � � r y = r y /2 + r a /2 And iterate r a = r y /2 + r m r m = r a /2 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
Recommend
More recommend