CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Web Decision Association Clustering Detection advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
Connections between political blogs Polarization of the network [Adamic-Glance, 2005] 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
Citation networks and Maps of science [Börner et al., 2012] 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
domain2 domain1 router domain3 Internet 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
Seven Bridges of Königsberg [Euler, 1735] Return to the starting point by traveling each link of the graph once and only once. 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 7
Web as a directed graph: Nodes: Webpages Edges: Hyperlinks I teach a class on CS224W: Networks. Classes are in the Gates Computer building Science Department at Stanford Stanford University 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
Web as a directed graph: Nodes: Webpages Edges: Hyperlinks I teach a class on CS224W: Networks. Classes are in the Gates Computer building Science Department at Stanford Stanford University 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 10
How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval investigates: Find relevant docs in a small and trusted set Newspaper articles, Patents, etc. But: Web is huge , full of untrusted documents, random things, web spam, etc. 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11
2 challenges of web search: (1) Web contains many sources of information Who to “trust”? Trick: Trustworthy pages may point to each other! (2) What is the “best” answer to query “newspaper”? No single right answer Trick: Pages that actually know about newspapers might all be pointing to many newspapers 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 12
All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 13
We will cover the following Link Analysis approaches for computing importances of nodes in a graph: Page Rank Hubs and Authorities (HITS) Topic-Specific (Personalized) Page Rank Web Spam Detection Algorithms 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 14
Idea: Links as votes Page is more important if it has more links In-coming links? Out-going links? Think of in-links as votes: www.stanford.edu has 23,400 in-links www.joe-schmoe.com has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question! 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
Each link’s vote is proportional to the importance of its source page If page j with importance r j has n out-links, each link gets r j / n votes Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 17
A “vote” from an important The web in 1839 page is worth more y/2 A page is important if it is y pointed to by other important a/2 pages y/2 Define a “rank” r j for page j m a m a/2 r i r “Flow” equations: j d r y = r y /2 + r a /2 i j i r a = r y /2 + r m r m = r a /2 𝒆 𝒋 … out -degree of node 𝒋 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 18
Flow equations: 3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2 No unique solution All solutions equivalent modulo the scale factor Additional constraint forces uniqueness: 𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐 Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs We need a new formulation! 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
Stochastic adjacency matrix 𝑵 Let page 𝑗 has 𝑒 𝑗 out-links 1 If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗 𝑵 is a column stochastic matrix Columns sum to 1 Rank vector 𝒔 : vector with an entry per page 𝑠 𝑗 is the importance score of page 𝑗 𝑠 𝑗 = 1 𝑗 r The flow equations can be written i r j d 𝒔 = 𝑵 ⋅ 𝒔 i j i 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
r i r Remember the flow equation: j d Flow equation in the matrix form i j i 𝑵 ⋅ 𝒔 = 𝒔 Suppose page i links to 3 pages, including j i r j j . = r i 1/3 . M r r = 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
The flow equations can be written 𝒔 = 𝑵 ∙ 𝒔 So the rank vector r is an eigenvector of the stochastic web matrix M In fact, its first or principal eigenvector, NOTE: x is an eigenvector with with corresponding eigenvalue 1 the corresponding eigenvalue λ if: Largest eigenvalue of M is 1 since M is 𝑩𝒚 = 𝝁𝒚 column stochastic We know r is unit length and each column of M sums to one, so 𝑵𝒔 ≤ 𝟐 We can now efficiently solve for r ! The method is called Power iteration 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 22
y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = M∙r r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 23
Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme Suppose there are N web pages ( t ) r Initialize: r (0) = [1/N,….,1/N] T ( 1 ) t i r j d Iterate: r (t+1) = M ∙ r (t) i j i d i …. out -degree of node i Stop when | r (t+1) – r (t) | 1 < | x | 1 = 1 ≤ i ≤ N |x i | is the L 1 norm 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 24
y a m Power Iteration: y y ½ ½ 0 Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m 𝑠 𝑗 m 0 ½ 0 1: 𝑠′ 𝑘 = 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 2: 𝑠 = 𝑠′ r a = r y /2 + r m Goto 1 r m = r a /2 Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 25
y a m Power Iteration: y y ½ ½ 0 Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m 𝑠 𝑗 m 0 ½ 0 1: 𝑠′ 𝑘 = 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 2: 𝑠 = 𝑠′ r a = r y /2 + r m Goto 1 r m = r a /2 Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 26
Power iteration: A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue) 𝒔 (𝟐) = 𝑵 ⋅ 𝒔 (𝟏) 𝒔 (𝟑) = 𝑵 ⋅ 𝒔 𝟐 = 𝑵 𝑵𝒔 𝟐 = 𝑵 𝟑 ⋅ 𝒔 𝟏 𝒔 (𝟒) = 𝑵 ⋅ 𝒔 𝟑 = 𝑵 𝑵 𝟑 𝒔 𝟏 = 𝑵 𝟒 ⋅ 𝒔 𝟏 Claim: Sequence 𝑵 ⋅ 𝒔 𝟏 , 𝑵 𝟑 ⋅ 𝒔 𝟏 , … 𝑵 𝒍 ⋅ 𝒔 𝟏 , … approaches the dominant eigenvector of 𝑵 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 27
Recommend
More recommend