CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu
How to organize/navigate it? How to organize/navigate it? First try: y Web directories Yahoo, , DMOZ, LookSmart LookSmart 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
SEARCH! SEARCH! Find relevant docs in a small and trusted set: Newspaper articles Patents, etc. Patents, etc. Two traditional problems: Synonimy: buy – purchase, sick – ill Polysemi: jaguar 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
D Does more documents mean better results? d t b tt lt ? 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
What is “best” answer to query “Stanford”? What is best answer to query Stanford ? Anchor Text: I go to Stanford where I study What about query “newspaper”? What about query newspaper ? No single right answer Scarcity (IR) vs abundance (Web) of information Scarcity (IR) vs. abundance (Web) of information Web: Many sources of information. Who to “trust” Trick: Trick: Pages that actually know about newspapers might all be pointing to many newspapers might all be pointing to many newspapers Ranking! 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Goal (back to the newspaper example): Goal (back to the newspaper example): Don’t just find newspapers.Find “experts” – people who link in a coordinated way to good newspapers Idea: Links as votes Idea: Links as votes Page is more important if it has more links In ‐ coming links? Out ‐ going links? NYT: 10 Hubs and Authorities Ebay: 3 Quality as an expert (hub): Q y p ( ) Total sum of votes of pages pointed to Yahoo: 3 Quality as an content (authority): CNN: 8 Total sum of votes of experts Total sum of votes of experts Principle of repeated improvement WSJ: 9 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
[Kleinberg ‘98] Each page i has 2 kinds of scores: Each page i has 2 kinds of scores: Hub score: h i Authority score : a i y i HITS algorithm: Initialize: a i =h i =1 i i Then keep iterating: h a h Authority: h i j i i j h a Hub: i j i j Normalize: a i =1, h i =1 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
[Kleinberg ‘98] HITS converges to a single stable point HITS converges to a single stable point Slightly change the notation: Vector a=(a Vector a=(a 1 …,a n ), h=(h 1 …,h n ) a ) h=(h h ) Adjacency matrix ( n x n ): M ij =1 if i j Then: Then: h a h M a i j i ij j i i j j j j h Ma So: a T a M M h h And likewise: And likewise: 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
Algorithm in new notation: Algorithm in new notation: Set: a = h = 1 n Repeat: Repeat: h=Ma, a=M T h Normalize Then: a=M T (Ma) T a is being updated (in 2 steps): new h M T (Ma)=(M T M)a ( ) ( ) new a new a h is updated (in 2 steps): Thus, in 2k steps: M (M T h)=(MM T )h a=(M T M) k a ( ) h=(MM T ) k h Repeated matrix powering 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
Definition: Definition: Let Ax= x for some scalar , vector x and matrix A Then x is an eigenvector, and is its eigenvalue d i it Th i i t i l Fact: If A is symmetric ( A ij =A ji ) (in our case M T M and MM T are symmetric) ( y ) Then A has n orthogonal unit eigenvectors w 1 …w n that form a basis (coordinate system) with eigenvalues 1 ... n (| i | | i+1 |) 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Write x in coordinate system w 1 Write x in coordinate system w 1 …w n w x= i i w i x has coordinates ( 1 ,…, n ) x has coordinates ( 1 ,…, n ) Suppose: 1 ... n (| 1 | | 2 | … | n |) k ) = k w A k x = ( k ( 1 1 , 2 2 ,…., n n ) i i w i k A x As k , if we normalize A k x 1 1 w 1 A x 1 1 w 1 (all other coordinates 0) So authority a is eigenvector of M T M associated with largest eigenvalue 1 l t i l 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
The web in 1839 A vote from an important A vote from an important y/2 page is worth more y y A page is important if it is A page is important if it is pointed to by other a/2 y/2 important pages important pages m a m Define a “rank” r j for node j a/2 r should be proportional to: r j should be proportional to: Flow equations: r y = y /2 + a /2 r i j j a = y /2 + m /2 outdegree of i j i m = a /2 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Stochastic adjacency matrix M Stochastic adjacency matrix M Let page j has d j out ‐ links If j → i , then M ij = 1/ d j else M ij = 0 ij j ij M is a column stochastic matrix Columns sum to 1 Rank vector r : vector with 1 entry per page R k i h 1 r i is the importance score of page i |r| = 1 |r| = 1 The flow equations can be written r = Mr 11/29/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16
Imagine a random web surfer: Imagine a random web surfer: At any time t , surfer is on some page u At ti At time t+1 , the surfer follows an out ‐ link t+1 th f f ll t li k from u uniformly at random Ends up on some page v linked from u Ends up on some page v linked from u Process repeats indefinitely Let: Let: p (t) … vector whose i th coordinate is the prob. that the surfer is at page i at time t prob. that the surfer is at page i at time t p (t) is a probability distribution over pages 11/29/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17
Where is the surfer at time t+1 ? Where is the surfer at time t+1 ? Follows a link uniformly at random p (t+1) = Mp (t) p (t+1) = Mp (t) Suppose the random walk reaches a state p (t+1) = Mp (t) = p (t) (t+1) M (t) (t) then p (t) is stationary distribution of a random walk Our rank vector r satisfies r = Mr O k i fi M So it is a stationary distribution for the random surfer f 11/29/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18
Power Iteration: Power Iteration: Set r i =1 y a m r j = i r i /d i y y /d y ½ ½ 0 And iterate a ½ 0 1 a m m 0 ½ 0 Example: y 1 1 1 1 5/4 5/4 9/8 9/8 6/5 6/5 a = 1 3/2 1 11/8 … 6/5 m 1 ½ ¾ ½ 3/5 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
Some pages are “dead ends” Some pages are dead ends (have no out ‐ links) Such pages cause importance Such pages cause importance to leak out Spider traps (all out links are within the group) within the group) Eventually spider traps absorb all importance 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
Power Iteration: Power Iteration: y a m Set r i =1 y y ½ ½ 0 r j = i r i /d i /d a ½ 0 0 And iterate a m 0 ½ 0 m Example: y 1 1 1 1 ¾ ¾ 5/8 5/8 0 0 a = 1 ½ ½ 3/8 … 0 m 1 ½ ¼ ¼ 0 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
y y a m Power Iteration: Power Iteration: y y y ½ ½ 0 Set r i =1 a ½ 0 0 a r j = i r i /d i /d m m m 0 0 ½ ½ 1 1 And iterate Example: y 1 1 1 1 ¾ ¾ 5/8 5/8 0 0 a = 1 ½ ½ 3/8 … 0 m 1 3/2 7/4 2 3 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
Recommend
More recommend