Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org
High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps Apps data data data data data data learning learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Web Decision Association Clustering Detection advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2
Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
Connections between political blogs Polarization of the network [Adamic-Glance, 2005] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Citation networks and Maps of science [Börner et al., 2012] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
domain2 domain1 router domain3 Internet J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Seven Bridges of Königsberg [Euler, 1735] Return to the starting point by traveling each link of the graph once and only once. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
� Web as a directed graph: � Nodes: Webpages � Edges: Hyperlinks I teach a class on CS224W: Networks. Classes are in the Gates Computer building Science Department at Stanford Stanford University J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
� Web as a directed graph: � Nodes: Webpages � Edges: Hyperlinks I teach a class on CS224W: Networks. Classes are in the Gates Computer building Science Department at Stanford Stanford University J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
� How to organize the Web? � First try: Human curated Web directories � Yahoo, DMOZ, LookSmart � Second try: Web Search � Information Retrieval investigates: Find relevant docs in a small and trusted set � Newspaper articles, Patents, etc. � But: Web is huge , full of untrusted documents, random things, web spam, etc. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
2 challenges of web search: � (1) Web contains many sources of information Who to “trust”? � Trick: Trustworthy pages may point to each other! � (2) What is the “best” answer to query “newspaper”? � No single right answer � Trick: Pages that actually know about newspapers might all be pointing to many newspapers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
� All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu � There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
� We will cover the following Link Analysis approaches for computing importances of nodes in a graph: � Page Rank � Topic-Specific (Personalized) Page Rank � Web Spam Detection Algorithms J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
� Idea: Links as votes � Page is more important if it has more links � In-coming links? Out-going links? � Think of in-links as votes: � www.stanford.edu has 23,400 in-links � www.joe-schmoe.com has 1 in-link � Are all in-links are equal? � Links from important pages count more � Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
� Each link’s vote is proportional to the importance of its source page � If page j with importance r j has n out-links, each link gets r j / n votes � Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
� A “vote” from an important The web in 1839 page is worth more y/2 � A page is important if it is y pointed to by other important a/2 pages y/2 � Define a “rank” r j for page j m a m a/2 r � i = r “Flow” equations: j d r y = r y /2 + r a /2 → i j i r a = r y /2 + r m r m = r a /2 � � … out-degree of node � J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Flow equations: � 3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2 � No unique solution � All solutions equivalent modulo the scale factor � Additional constraint forces uniqueness: � � � �� � � ��� � �� �� � � � � Solution: � � �� � � � � �� � � � � �� � � Gaussian elimination method works for small examples, but we need a better method for large web-size graphs � We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
� Stochastic adjacency matrix � � Let page � has � � out-links � If �� � � , then � �� � � � else � �� � � �� � � � � is a column stochastic matrix � Columns sum to 1 � Rank vector � : vector with an entry per page � � � is the importance score of page � � � � � � � � r � i = � The flow equations can be written r j d → i j i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
r � = i r � Remember the flow equation: j d � Flow equation in the matrix form → i j i � � �� � � � Suppose page i links to 3 pages, including j i r j j . = r i 1/3 . M r r = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22
� The flow equations can be written �� � ��� � �� � So the rank vector r is an eigenvector of the stochastic web matrix M � In fact, its first or principal eigenvector, with corresponding eigenvalue 1 NOTE: x is an eigenvector with � Largest eigenvalue of M is 1 since M is the corresponding eigenvalue � if: column stochastic (with non-negative entries) �� � � � We know r is unit length and each column of M sums to one, so �� � � � We can now efficiently solve for r ! The method is called Power iteration J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = M∙r r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24
� Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks � Power iteration: a simple iterative scheme � Suppose there are N web pages ( t ) � r + = � Initialize: r (0) = [1/N,….,1/N] T ( 1 ) t i r j d � Iterate: r (t+1) = M ∙ r (t) → i j i d i …. out-degree of node i � Stop when | r (t+1) – r (t) | 1 < ε | x | 1 = � 1 � i � N |x i | is the L 1 norm Can use any other vector norm, e.g., Euclidean J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25
y a m � Power Iteration: y y ½ ½ 0 � Set � ! � � /N a ½ 0 1 a m # $ m 0 ½ 0 � 1: �" ! � � ��! � $ r y = r y /2 + r a /2 � 2: �� � �" r a = r y /2 + r m � Goto 1 r m = r a /2 � Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
y a m � Power Iteration: y y ½ ½ 0 � Set � ! � � /N a ½ 0 1 a m # $ m 0 ½ 0 � 1: �" ! � � ��! � $ r y = r y /2 + r a /2 � 2: �� � �" r a = r y /2 + r m � Goto 1 r m = r a /2 � Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
Recommend
More recommend