http://www.mmds.org High dim. High dim. Graph Graph Infinite - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps Apps data data data data data data learning learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Web Decision Association Clustering Detection advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Connections between political blogs Polarization of the network [Adamic-Glance, 2005] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

Citation networks and Maps of science [Börner et al., 2012] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

domain2 domain1 router domain3 Internet J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

Seven Bridges of Königsberg [Euler, 1735] Return to the starting point by traveling each link of the graph once and only once. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

� Web as a directed graph: � Nodes: Webpages � Edges: Hyperlinks I teach a class on CS224W: Networks. Classes are in the Gates Computer building Science Department at Stanford Stanford University J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

� Web as a directed graph: � Nodes: Webpages � Edges: Hyperlinks I teach a class on CS224W: Networks. Classes are in the Gates Computer building Science Department at Stanford Stanford University J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

� How to organize the Web? � First try: Human curated Web directories � Yahoo, DMOZ, LookSmart � Second try: Web Search � Information Retrieval investigates: Find relevant docs in a small and trusted set � Newspaper articles, Patents, etc. � But: Web is huge , full of untrusted documents, random things, web spam, etc. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

2 challenges of web search: � (1) Web contains many sources of information Who to “trust”? � Trick: Trustworthy pages may point to each other! � (2) What is the “best” answer to query “newspaper”? � No single right answer � Trick: Pages that actually know about newspapers might all be pointing to many newspapers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

� All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu � There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

� We will cover the following Link Analysis approaches for computing importances of nodes in a graph: � Page Rank � Topic-Specific (Personalized) Page Rank � Web Spam Detection Algorithms J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

� Idea: Links as votes � Page is more important if it has more links � In-coming links? Out-going links? � Think of in-links as votes: � www.stanford.edu has 23,400 in-links � www.joe-schmoe.com has 1 in-link � Are all in-links are equal? � Links from important pages count more � Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

� Each link’s vote is proportional to the importance of its source page � If page j with importance r j has n out-links, each link gets r j / n votes � Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

� A “vote” from an important The web in 1839 page is worth more y/2 � A page is important if it is y pointed to by other important a/2 pages y/2 � Define a “rank” r j for page j m a m a/2 r � i = r “Flow” equations: j d r y = r y /2 + r a /2 → i j i r a = r y /2 + r m r m = r a /2 � � … out-degree of node � J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

Flow equations: � 3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2 � No unique solution � All solutions equivalent modulo the scale factor � Additional constraint forces uniqueness: � � � �� Solution: � � �� Gaussian elimination method works for small examples, but we need a better method for large web-size graphs � We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

� Stochastic adjacency matrix � � Let page � has � � out-links � If �� , then � �� else � �� is a column stochastic matrix � Columns sum to 1 � Rank vector � : vector with an entry per page � � � is the importance score of page � � � � � � � � r � i = � The flow equations can be written r j d → i j i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

r � = i r � Remember the flow equation: j d � Flow equation in the matrix form → i j i � � �� Suppose page i links to 3 pages, including j i r j j . = r i 1/3 . M r r = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

� The flow equations can be written �� So the rank vector r is an eigenvector of the stochastic web matrix M � In fact, its first or principal eigenvector, with corresponding eigenvalue 1 NOTE: x is an eigenvector with � Largest eigenvalue of M is 1 since M is the corresponding eigenvalue � if: column stochastic (with non-negative entries) �� We know r is unit length and each column of M sums to one, so �� We can now efficiently solve for r ! The method is called Power iteration J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = M∙r r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

� Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks � Power iteration: a simple iterative scheme � Suppose there are N web pages ( t ) � r + = � Initialize: r (0) = [1/N,….,1/N] T ( 1 ) t i r j d � Iterate: r (t+1) = M ∙ r (t) → i j i d i …. out-degree of node i � Stop when | r (t+1) – r (t) | 1 < ε | x | 1 = � 1 � i � N |x i | is the L 1 norm Can use any other vector norm, e.g., Euclidean J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

y a m � Power Iteration: y y ½ ½ 0 � Set � ! � � /N a ½ 0 1 a m # $ m 0 ½ 0 � 1: �" ! � � ��! � $ r y = r y /2 + r a /2 � 2: �� " r a = r y /2 + r m � Goto 1 r m = r a /2 � Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

y a m � Power Iteration: y y ½ ½ 0 � Set � ! � � /N a ½ 0 1 a m # $ m 0 ½ 0 � 1: �" ! � � ��! � $ r y = r y /2 + r a /2 � 2: �� " r a = r y /2 + r m � Goto 1 r m = r a /2 � Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

http://www.mmds.org High dim. High dim. Graph Graph Infinite - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://www.mmds.org High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

Infinite graphs P eter Komj ath LC12 P eter Komj ath Infinite graphs Infinite

MM MMDS Moroccan Membrane and Desalination Society Moroccan Membrane and Desalination Society

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

www.escardio.org www.escardio.org www.escardio.org www.escardio.org www.escardio.org

Name: Prone Leg Curl Tube Thickness: 3.0mm Dim: 196013501180mm Weight: 400KG Model No: EJ01

Name: Leg Extension Tube Thickness: 2.5mm Dim: 140105150cm Weight: 214KG Model No: OE502

Name: Prone Leg Curl Tube Thickness: 2.5mm Dim: 15299135cm Weight: 216 KG Model No: TT101

Infinite Campus Parent Portal Scan and Go https://goo.gl/kNtHrw Infinite Campus Parent Portal

Happy 103rd birthday, Richard Guy Karl Dilcher Infinite products Infinite products involving

Infinite dimensional sub-Riemannian geometry Sylvain Arguill` ere (CIS, Johns Hopkins

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Compositions and Infinite Matrices Rod Canfield 9 Feb 2013 Compositions and Infinite Matrices

The Analysis of Infinite-State Systems Bernard Boigelot Universit e de Li` ege

Chapter 22 Envisioning Design Todd Knoll Overview Definition of Envisioning Design

Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google,

1 Text Nave Bayes Algorithm Text Nave Bayes Algorithm (Train) (Test) Let V be the

http://cs224w.stanford.edu How to organize/navigate it? How to organize/navigate it?

Implementation of XQuery Part 3: Support for Streaming XML Motivation XQuery used in very

NAIS Presentation Schneider Berwick Innovation Center materials on our website at:

1 Welcome! Personalized Learning: Meeting the Needs of Students with Disabilities & English

DESIGN FICTIONS FABIEN GIRARDIN, 21.10.2015, BARCELONA WWW.NEARFUTURELABORATORY.COM Why the Near

http://www.mmds.org High dim. High dim. Graph Graph Infinite - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://www.mmds.org High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

Infinite graphs P eter Komj ath LC12 P eter Komj ath Infinite graphs Infinite

MM MMDS Moroccan Membrane and Desalination Society Moroccan Membrane and Desalination Society

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

www.escardio.org www.escardio.org www.escardio.org www.escardio.org www.escardio.org

Name: Prone Leg Curl Tube Thickness: 3.0mm Dim: 1960*1350*1180mm Weight: 400KG Model No: EJ01

Name: Leg Extension Tube Thickness: 2.5mm Dim: 140*105*150cm Weight: 214KG Model No: OE502

Name: Prone Leg Curl Tube Thickness: 2.5mm Dim: 152*99*135cm Weight: 216 KG Model No: TT101

Infinite Campus Parent Portal Scan and Go https://goo.gl/kNtHrw Infinite Campus Parent Portal

Happy 103rd birthday, Richard Guy Karl Dilcher Infinite products Infinite products involving

Infinite dimensional sub-Riemannian geometry Sylvain Arguill` ere (CIS, Johns Hopkins

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Compositions and Infinite Matrices Rod Canfield 9 Feb 2013 Compositions and Infinite Matrices

The Analysis of Infinite-State Systems Bernard Boigelot Universit e de Li` ege

Chapter 22 Envisioning Design Todd Knoll Overview Definition of Envisioning Design

Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google,

1 Text Nave Bayes Algorithm Text Nave Bayes Algorithm (Train) (Test) Let V be the

http://cs224w.stanford.edu How to organize/navigate it? How to organize/navigate it?

Implementation of XQuery Part 3: Support for Streaming XML Motivation XQuery used in very

NAIS Presentation Schneider Berwick Innovation Center materials on our website at:

1 Welcome! Personalized Learning: Meeting the Needs of Students with Disabilities &amp; English

DESIGN FICTIONS FABIEN GIRARDIN, 21.10.2015, BARCELONA WWW.NEARFUTURELABORATORY.COM Why the Near

Name: Prone Leg Curl Tube Thickness: 3.0mm Dim: 196013501180mm Weight: 400KG Model No: EJ01

Name: Leg Extension Tube Thickness: 2.5mm Dim: 140105150cm Weight: 214KG Model No: OE502

Name: Prone Leg Curl Tube Thickness: 2.5mm Dim: 15299135cm Weight: 216 KG Model No: TT101

1 Welcome! Personalized Learning: Meeting the Needs of Students with Disabilities & English