Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, December 4 th typo’s fixed: edge order IRDM ‘15/16 1 Dec 2015
IRDM Chapter 8, overview The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Community Detection 4. Graph Clustering 5. You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16 VIII-1: 2 IRDM ‘15/16
IRDM Chapter 8, today The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Community Detection 4. Graph Clustering 5. You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16 VIII-1: 3 IRDM ‘15/16
Chapter 7.1: The B he Basi asics Aggarwal Ch. 17.1 VIII-1: 4 IRDM ‘15/16
Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] VIII-1: 5 IRDM ‘15/16
The Internet Skewed Degrees Robustness VIII-1: 6 IRDM ‘15/16
High school dating network Blue: Male Pink: Female Interesting observations? (Bearman et. al. Am. Jnl. of Sociology, 2004. Image: Mark Newman) VIII-1: 7 IRDM ‘15/16
Karate club network VIII-1: 8 IRDM ‘15/16
Friends How many of you think that your friends have mor more friends than you? A recent Facebook study examined all of FB’s users: 721 million people with 69 billion friendships about 10% of the world’s population found that 93 percent of the time a user’s friend count was le less tha han n the average frie iend nd count nt of his or her friends, users had an average of 190 friends, while their friends averaged 635 friends of their own VIII-1: 9 IRDM ‘15/16
Reasons? You are a loner? Your friends are extraverts? There are more extraverts than introverts in the world? VIII-1: 10 IRDM ‘15/16
Example Average number Average number of fri friends? of fri friends o of f fri friends? = 1 + 3 + 2 + 2 = (3 + 1 + 2 + 2 + 4 3 + 2 + 3 + 2)/8 = 2 = ( 1 × 1 + 3 × 3 + 2 × 2 + (2 × 2))/8 = 𝟑 . 𝟑𝟑 (Strogatz, NYT 2012) VIII-1: 11 IRDM ‘15/16
Always true (almost)! Proof? 𝐹 𝑌 = ∑𝑦 𝑗 / 𝑂 2 𝑊𝑊𝑊 𝑌 = 𝐹 𝑌 − 𝐹 𝑌 = 𝐹 𝑌 2 − 𝐹 𝑌 2 𝐹 𝑌 2 = 𝐹 𝑌 + 𝑊𝑊𝑊 𝑌 𝐹 𝑌 𝐹 𝑌 Essentially, it’s true if there is an any spread in the number of friends (i.e. whenever there’s a non-zero variance). VIII-1: 12 IRDM ‘15/16
Why graphs? Many real-world data sets are in the forms of graphs social networks hyperlinks protein–protein interaction XML parse trees … Many of these graphs are enormous humans cannot understand them → a task for data mining! VIII-1: 13 IRDM ‘15/16
What is a graph? A graph ph 𝐻 is a pair ( 𝑊 , 𝐹 ⊆ 𝑊 2 ) elements in 𝑊 are ve verti tices or nod nodes of the graph pairs 𝑤 , 𝑣 ∈ 𝐹 are edges edges or arcs cs of the graph for undir irect cted gra raph phs pairs are unor unordered, for dir irect cted gra raph phs pairs are ordered The graphs can be la labelle lled vertices can have labeling 𝑀 ( 𝑤 ) edges can have labeling 𝑀 ( 𝑤 , 𝑣 ) A tree is a rooted, connec nected ed, and acyc yclic graph Graphs can be represented using adjacency cency matr atrices | 𝑊 | × | 𝑊 | matrix 𝐵 with 𝐵 𝑗𝑗 = 1 if 𝑤 𝑗 , 𝑤 𝑗 ∈ 𝐹 VIII-1: 14 IRDM ‘15/16
Eccentricity, radius, and diameter The dis istance 𝑒 ( 𝑤 𝑗 , 𝑤 𝑗 ) between two vertices is the (weighted) length of the shortest path between them The eccent ntric icity of a vertex 𝑤 𝑗 , 𝑓 ( 𝑤 𝑗 ) , is its maximum distance to any other vertex, max 𝑗 { 𝑒 ( 𝑤 𝑗 , 𝑤 𝑗 )} The radius ius of a connected graph, 𝑊 ( 𝐻 ) , is the minimum eccentricity of any vertex, min 𝑗 { 𝑓 ( 𝑤 𝑗 )} The diamet eter er of a connected graph, 𝑒 ( 𝐻 ) , is the maximum eccentricity of any vertex, max 𝑗 , 𝑗 { 𝑒 ( 𝑤 𝑗 , 𝑤 𝑗 )} 𝑗 { 𝑓 ( 𝑤 𝑗 )} = max the effect ctive d dia iameter of a graph is smallest number that is larger than the eccentricity of a large fraction of the vertices in the graph VIII-1: 15 IRDM ‘15/16
Clustering Coefficient The cluster ering g coeffi ficien ent of vertex 𝑤 𝑗 , 𝐷 ( 𝑤 𝑗 ) , tells how clique-like the neighbourhood of 𝑤 𝑗 is let 𝑜 𝑗 be the number of neighbours of 𝑤 𝑗 and 𝑛 𝑗 the number of edges between the neighbours of 𝑤 𝑗 excluding 𝑤 𝑗 itself 𝐷 𝑤 𝑗 = 𝑛 𝑗 2 𝑛 𝑗 𝑜 𝑗 𝑜 𝑗 − 1 = 𝑜 𝑗 2 well-defined only for 𝑤 𝑗 with at least two neighbours for others, let 𝐷 ( 𝑤 𝑗 ) = 0 The clu lustering ng coefficie ient nt of the graph h is the average clustering coefficient of the vertices: 𝐷 ( 𝐻 ) = 𝑜 −1 � 𝐷 ( 𝑤 𝑗 ) 𝑗 VIII-1: 16 IRDM ‘15/16
What do to with a graph? There are many interesting data one can mine from graphs and sets of graphs cliques of friends from social networks hubs and authorities from link graphs who is the centre of the Hollywood subgraphs that appear frequently in (a set of) graph(s) areas with higher inter-connectivity than intra-connectivity … Graph mining is perhaps the most popular topic in contemporary data mining research though not necessary called as such… VIII-1: 17 IRDM ‘15/16
Chapter 7.2: Properties o s of f Gra Graphs hs Aggarwal Ch. 17.1, 19.2; Zaki & Meira Ch 4 VIII-1: 18 IRDM ‘15/16
Centrality Six degrees of Kevin Bacon ”Every actor is related to Kevin Bacon by no more than 6 hops” Kevin Bacon has acted with many, that have acted with many others, that have acted with many others… this makes Kevin Bacon a centre of the co-acting graph Kevin, however, is not the centre: the average distance to him is 2.998 but to Harvey Keitel it is only 2.848 (http://oracleofbacon.org) VIII-1: 19 IRDM ‘15/16
Degree and eccentricity Centrality Centr tral ality ty is a function 𝑑 ∶ 𝑊 → ℝ inducing a total order in 𝑊 the higher the centrality of a vertex, the more important it is In degr gree c central ality ty 𝑑 ( 𝑤 𝑗 ) = 𝑒 ( 𝑤 𝑗 ) , the degree of the vertex In ecce ccentric icit ity ce centralit lity the least eccentric vertex is 1 the most central one, 𝑑 𝑤 𝑗 = 𝑓 𝑤 𝑗 the least eccentric vertex is central al the most eccentric vertex is peripheral al VIII-1: 20 IRDM ‘15/16
Closeness centrality In closeness c eness cent ntra rality y the vertex with least distance to all other vertices is the centre −1 𝑑 𝑤 𝑗 = � 𝑒 𝑤 𝑗 , 𝑤 𝑗 𝑗 In eccentricity centrality we aim to minimize the maxi aximum d distan tance In closeness centrality we aim to minimize the avera verage d e distance nce this is the distance used to measure the centre of Hollywood VIII-1: 21 IRDM ‘15/16
Betweenness centrality Betweenness centrality measures the number of sh shortest p path aths th that at tr trav avel th through 𝑤 𝑗 measures the “monitoring” role of the vertex “all roads lead to Rome” Let 𝜃 𝑗𝑘 be the number of shortest paths between 𝑤 𝑗 and 𝑤 𝑘 and let 𝜃 𝑗𝑘 ( 𝑤 𝑗 ) be the number of those that include 𝑤 𝑗 let 𝛿 𝑗𝑘 𝑤 𝑗 = 𝜃 𝑗𝑘 ( 𝑤 𝑗 )/ 𝜃 𝑗𝑘 betweenness centrality is defined as 𝑑 𝑤 𝑗 = � � 𝛿 𝑗𝑘 𝑗≠𝑗 𝑘≠𝑗 𝑘>𝑗 VIII-1: 22 IRDM ‘15/16
Prestige In presti stige ge, the vertex is more central if it has many incoming edges from other vertices of high prestige 𝐵 is the adjacency matrix of the directed graph 𝐻 𝑞 is 𝑜 -dimensional vector giving the prestige of the vertices 𝑞 = 𝐵 𝑈 𝑞 starting from an initial prestige vector 𝑞 0 , we get 𝑞 𝑘 = 𝐵 𝑈 𝑞 𝑘−1 = 𝐵 𝑈 𝐵 𝑈 𝑞 𝑘−2 = 𝐵 𝑈 2 𝑞 𝑘−2 = 𝐵 𝑈 3 𝑞 𝑘−3 = ⋯ = 𝐵 𝑈 𝑘 𝑞 0 Vector 𝑞 converges to the dominant eigenvector of 𝐵 𝑈 under some assumptions (PageRank is based on (normalized) prestige) VIII-1: 23 IRDM ‘15/16
Graph properties Several real-world graphs exhibit certain characteristics studying what these are and explaining why they appear is an important area of network research As data miners, we need to understand the cons onseque uenc nces of these characteristics finding a result that can be explained merely by one of these characteristics is not interesting We also want to mod model graphs with these characteristics VIII-1: 24 IRDM ‘15/16
It’s a small world after all A graph 𝐻 is said to exhibit a sm smal all-worl rld p proper perty if its average path length scales logarithmically, 𝜈 𝑀 ∝ log 𝑜 six degrees of Kevin Bacon is based on this property similarly so for Erd ő s numbers how far a mathematician is from Hungarian combinatorist Paul Erd ő s radius of a large, connected mathematical co-authorship network (268K authors) is 12 and diameter 23 VIII-1: 25 IRDM ‘15/16
Recommend
More recommend