data mining and machine learning fundamental concepts and
play

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


  1. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 4: Graph Data Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 1 / 48

  2. Graphs A graph G = ( V , E ) comprises a finite nonempty set V of vertices or nodes , and a set E ⊆ V × V of edges consisting of unordered pairs of vertices. The number of nodes in the graph G , given as | V | = n , is called the order of the graph, and the number of edges in the graph, given as | E | = m , is called the size of G . A directed graph or digraph has an edge set E consisting of ordered pairs of vertices. A weighted graph consists of a graph together with a weight w ij for each edge ( v i , v j ) ∈ E . A graph H = ( V H , E H ) is called a subgraph of G = ( V , E ) if V H ⊆ V and E H ⊆ E . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 2 / 48

  3. Undirected and Directed Graphs v 1 v 2 v 1 v 2 v 3 v 4 v 5 v 6 v 3 v 4 v 5 v 6 v 7 v 8 v 7 v 8 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 3 / 48

  4. Degree Distribution The degree of a node v i ∈ V is the number of edges incident with it, and is denoted as d ( v i ) or just d i . The degree sequence of a graph is the list of the degrees of the nodes sorted in non-increasing order. Let N k denote the number of vertices with degree k . The degree frequency distribution of a graph is given as ( N 0 , N 1 ,..., N t ) where t is the maximum degree for a node in G . Let X be a random variable denoting the degree of a node. The degree distribution of a graph gives the probability mass function f for X , given as � � f ( 0 ) , f ( 1 ) ,..., f ( t ) where f ( k ) = P ( X = k ) = N k n is the probability of a node with degree k . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 4 / 48

  5. Degree Distribution v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 The degree sequence of the graph is ( 4 , 4 , 4 , 3 , 2 , 2 , 2 , 1 ) Its degree frequency distribution is ( N 0 , N 1 , N 2 , N 3 , N 4 ) = ( 0 , 1 , 3 , 1 , 3 ) The degree distribution is given as � � f ( 0 ) , f ( 1 ) , f ( 2 ) , f ( 3 ) , f ( 4 ) = ( 0 , 0 . 125 , 0 . 375 , 0 . 125 , 0 . 375 ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 5 / 48

  6. Path, Distance and Connectedness A walk in a graph G between nodes x and y is an ordered sequence of vertices, starting at x and ending at y , x = v 0 , v 1 , ..., v t − 1 , v t = y such that there is an edge between every pair of consecutive vertices, that is, ( v i − 1 , v i ) ∈ E for all i = 1 , 2 ,..., t . The length of the walk, t , is measured in terms of hops – the number of edges along the walk. A path is a walk with distinct vertices (with the exception of the start and end vertices). A path of minimum length between nodes x and y is called a shortest path , and the length of the shortest path is called the distance between x and y , denoted as d ( x , y ) . If no path exists between the two nodes, the distance is assumed to be d ( x , y ) = ∞ . Two nodes v i and v j are connected if there exists a path between them. A graph is connected if there is a path between all pairs of vertices. A connected component , or just component , of a graph is a maximal connected subgraph. A directed graph is strongly connected if there is a (directed) path between all ordered pairs of vertices. It is weakly connected if there exists a path between node pairs only by considering edges as undirected. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 6 / 48

  7. Adjacency Matrix A graph G = ( V , E ) , with | V | = n vertices, can be represented as an n × n , symmetric binary adjacency matrix , A , defined as � 1 if v i is adjacent to v j A ( i , j ) = 0 otherwise If the graph is directed, then the adjacency matrix A is not symmetric. If the graph is weighted, then we obtain an n × n weighted adjacency matrix , A , defined as � w ij if v i is adjacent to v j A ( i , j ) = 0 otherwise where w ij is the weight on edge ( v i , v j ) ∈ E . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 7 / 48

  8. Graphs from Data Matrix Many datasets that are not in the form of a graph can still be converted into one. Let D = { x i } n i = 1 (with x i ∈ R d ), be a dataset. Define a weighted graph G = ( V , E ) , with edge weight w ij = sim ( x i , x j ) where sim ( x i , x j ) denotes the similarity between points x i and x j . For instance, using the Gaussian similarity � � −� x i − x j � 2 w ij = sim ( x i , x j ) = exp 2 σ 2 where σ is the spread parameter. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 8 / 48

  9. rS rS rS rS rS rS rS rS rS rS bC bC bC bC bC bC rS rS bC rS rS rS bC rS rS rS rS rS rS rS rS rS rS rS rS bC bC rS bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC rS rS bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS uT uT uT uT uT uT uT uT uT rS uT uT uT uT uT uT uT bC Iris Similarity Graph: Gaussian Similarity √ σ = 1 / 2; edge exists iff w ij ≥ 0 . 777 order: | V | = n = 150; size: | E | = m = 753 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 9 / 48

  10. Topological Graph Attributes Graph attributes are local if they apply to only a single node (or an edge), and global if they refer to the entire graph. Degree: The degree of a node v i ∈ G is defined as � d i = A ( i , j ) j The corresponding global attribute for the entire graph G is the average degree : � i d i µ d = n Average Path Length: The average path length is given as � � j > i d ( v i , v j ) 2 i � � µ L = = d ( v i , v j ) � n � n ( n − 1 ) 2 i j > i Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 10 / 48

  11. Iris Graph: Degree Distribution 0 . 10 0 . 09 13 13 0 . 08 0 . 07 10 9 0 . 06 f ( k ) 8 8 8 0 . 05 7 6 6 6 6 6 0 . 04 5 5 5 0 . 03 4 4 4 3 3 0 . 02 2 2 0 . 01 1 1 1 1 1 1 1 0 0 0 0 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Degree: k Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 11 / 48

  12. Iris Graph: Path Length Histogram 1044 1000 900 831 800 753 668 700 Frequency 600 529 500 400 330 300 240 200 146 90 100 30 12 0 0 1 2 3 4 5 6 7 8 9 10 11 Path Length: k Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 12 / 48

  13. Eccentricity, Radius and Diameter The eccentricity of a node v i is the maximum distance from v i to any other node in the graph: � � e ( v i ) = max d ( v i , v j ) j The radius of a connected graph, denoted r ( G ) , is the minimum eccentricity of any node in the graph: � �� � � � r ( G ) = min e ( v i ) = min max d ( v i , v j ) i i j The diameter , denoted d ( G ) , is the maximum eccentricity of any vertex in the graph: � � � � d ( G ) = max e ( v i ) = max d ( v i , v j ) i i , j For a disconnected graph, values are computed over the connected components of the graph. The diameter of a graph G is sensitive to outliers. Effective diameter is more robust; defined as the minimum number of hops for which a large fraction, typically 90%, of all connected pairs of nodes can reach each other. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 13 / 48

  14. Clustering Coefficient The clustering coeff icient of a node v i is a measure of the density of edges in the neighborhood of v i . Let G i = ( V i , E i ) be the subgraph induced by the neighbors of vertex v i . Note that v i �∈ V i , as we assume that G is simple. Let | V i | = n i be the number of neighbors of v i , and | E i | = m i be the number of edges among the neighbors of v i . The clustering coefficient of v i is defined as no. of edges in G i = m i 2 · m i � = C ( v i ) = � n i maximum number of edges in G i n i ( n i − 1 ) 2 The clustering coeff icient of a graph G is simply the average clustering coefficient over all the nodes, given as C ( G ) = 1 � C ( v i ) n i C ( v i ) is well defined only for nodes with degree d ( v i ) ≥ 2, thus define C ( v i ) = 0 if d i < 2. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 14 / 48

Recommend


More recommend