Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 2015
Service Announcement #0 The Case of The Lost Pen -- or – The Case of the Found Pen
Service Announcement #1 Next week, a guest lecture Mining Data that Changes by dr. Pauli Miettinen (MPI-INF)
Service Announcement #2 Exam. Oral. 3 rd and 4 th of August. Timeslots to be decided. Mail me if you want to participate, let me know if you have a preferred time/day.
Service Announcement #3 Introduction Patterns Correlation and Causation (Subjective) Interestingness Graphs Wrap-up + < ask-me-anything>
Service Announcement #2 <ask-me-anything>? Introduction Yes! Prepare questions on anything* Patterns you’ve always wanted to ask me. Correlation and Causation Mail them to me in advance, (Subjective) Interestingness or have me answer on the spot Graphs * preferably related to Wrap-up + < ask-us-anything> TADA, data mining, machine learning, science, the world, etc.
Question of the day How can we summarise the main structure of a graph in easily understandable terms?
Graphs Graphs are everywhere ↔ Everything* can be represented as a graph * almost
Graphs, formally We consider graphs 𝐻 = 𝑊, 𝐹 with 𝑊 the set of 𝑜 nodes, and 𝐹 a set of 𝑛 edges between nodes In general, nodes can have labels, and edges can have labels, weights and can be directed.
Real world graphs social networks road networks relational databases cellular networks biological networks
Real world graphs the internet
Graphs, formally Today we consider unlab labeled led unweig ight hted ed undir irect ected ed graphs. The adjacen jacency cy matrix 𝐵 then is an 𝑜 × 𝑜 matrix 𝐵 ∈ 0,1 𝑜×𝑜 where a cell 𝑏 𝑗,𝑘 = 1 iff 𝑗, 𝑘 ∈ 𝐹 and 0 otherwise. We call the number of edges 𝑒 𝑗 of a node 𝑗 its degree ee
Why summarisation? Visualization Guiding attention
Why summarisation? Visualization Guiding attention
Staring at an Adjacency Matrix
Staring at a Hairball I don’t see anything! Nodes : wiki editors Edges : co-edited
Example: Wikipedia Controversy Stars : Bipartite cores : edit wars admins, bots, heavy users Kiev vs. Kyiv vandals Nodes : wiki editors Edges : co-edited
Summary Statistics For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Ave vera rage ge degree. ee. Not very insightful.
Summary Statistics For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Degree e plots ts
Power laws
Summary Statistics For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Clust ster r coeffi ficien cient t (global) How clustered are the nodes in the graph? 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑑𝑚𝑝𝑡𝑓𝑒 𝑢𝑠𝑗𝑏𝑜𝑚𝑓𝑡 𝐷 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑑𝑝𝑜𝑜𝑓𝑑𝑢𝑓𝑒 𝑢𝑠𝑗𝑞𝑚𝑓𝑢𝑡 𝑝𝑔 𝑤𝑓𝑠𝑢𝑗𝑑𝑓𝑡 Counting triangles requires matrix multiplication, which takes 𝑃(𝑜 𝜕 ) where 𝜕 < 2.376 , but takes 𝑃 𝑜 2 space. (but fast estimators exist)
Summary Statistics For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Clust ster r coeffi ficien cient t (local cal) How close is the neighborhood of node 𝑗 to being a clique? 𝐷 𝑗 = 2 𝑘, 𝑙 ∈ 𝐹 𝑘, 𝑙 ∈ 𝑂 𝑗 𝑒 𝑗 (𝑒 𝑗 − 1) 2 ) at 𝑃(𝑜 2 ) space which is 𝑃(𝑒 𝑗
Summary Statistics For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Diamet ameter er The longest shortest path between two nodes. Requires calculating all shortest paths. Calculating shortest path takes 𝑃(𝑜 2 ) . So, no.
Scalability Many real world graphs are big, with 𝑜 in the order of millions. 𝑃(𝑜 2 ) is ver very scary for a graph miner. Current-day graph mining algorithms need to be linear ar in the number of edges, or else your paper will almost surely be reject cted. What are the implications?
Summarising a Graph Given : a graph
Summarising a Graph Given : a graph Find : a succinct summary with possibly overlapping subgraphs
Summarising a Graph Given : a graph Find : a succinct summary with possibly overlapping subgraphs
Summarising a Graph Given : a graph Find : a succinct summary with possibly overlapping subgraphs ≈ important graph structures .
Community Detection Assumed graph Adjacency Matrix
Community Detection Real graph Adjacency Matrix
Summarising a Graph Fully ly Automa utomatic tic Cross ss Associat sociations ons is a nice MDL based algorithm to summarise a matrix. R E A SSIGN : Given a grid, assign rows and columns 1) s.t. entropy within the grid is minimal. (Chakrabarti et al. 2004)
Summarising a Graph Fully ly Automa utomatic tic Cross ss Associat sociations ons is a nice MDL based algorithm to summarise a matrix. R E A SSIGN : Given a grid, assign rows and columns 1) s.t. entropy within the grid is minimal. C ROSS A SSOC : Find cluster with highest entropy, split it, run R E A SSIGN . 2) Stop when no split reduces the MDL score. (Chakrabarti et al. 2004)
Summarising a Graph Fully ly Automa utomatic tic Cross ss Associat sociations ons is a nice MDL based algorithm to summarise a matrix. R E A SSIGN : Given a grid, assign rows and columns 1) s.t. entropy within the grid is minimal. C ROSS A SSOC : Find cluster with highest entropy, split it, run R E A SSIGN . 2) Stop when no split reduces the MDL score. (Chakrabarti et al. 2004)
Beyond Cave-men Communities Traditional community detection algorithms assume that you interact only with people in your ‘cave’. You are assumed not t to interact with others, except if you are one of few ‘messengers’ between ‘caves’. That is not very realistic. (Kang & Faloutsos, ICDM 2011)
Slash’n’Burn Slash’n’Burn finds the node 𝑗 with highest 𝑒 𝑗 and removes its edges 𝑂 𝑗 and recurses. S LASH B URN : 1. Slash ash top- 𝑙 hubs, burn rn edges 2. Repeat on the remaining GCC Before (Kang & Faloutsos, ICDM 2011)
Slash’n’Burn Slash’n’Burn finds the node 𝑗 with highest 𝑒 𝑗 and removes its edges 𝑂 𝑗 and recurses. S LASH B URN : 1. Slash ash top- 𝑙 hubs, burn rn edges 2. Repeat on the remaining GCC (Kang & Faloutsos, ICDM 2011)
Slash’n’Burn Slash’n’Burn finds the node 𝑗 with highest 𝑒 𝑗 and removes its edges 𝑂 𝑗 and recurses. S LASH B URN : 1. Slash ash top- 𝑙 hubs, burn rn edges 2. Repeat on the remaining GCC After (Kang & Faloutsos, ICDM 2011)
Beyond Cave-men Communities Slash’n’Burn applied on the AS-Oregon graphs shows that real graphs indeed have structure beyond cave-men communities! – but also include those! A nice side-result is that the Slash’n’Burned ordered matrix has lots of ‘empty space’ and can hence be stored efficiently. (Kang & Faloutsos, ICDM 2011)
Korea Advanced Carnegie Mellon Institute of Science University and Technology VoG: Summarizing and Understanding Large Graphs Danai Koutra U Kang Jilles Vreeken Christos Faloutsos SDM, 25 April 2014, Philadelphia, USA
Main Idea Use a graph vocabulary: 1) Best graph summary 2) optimal compression (MDL)
Main Idea Use a graph vocabulary: 1) Shortest lossless description 2) optimal compression (MDL)
Minimum Description Length Given a set of models ℳ , 𝑁 the best model 𝑁 ∈ ℳ is ℳ arg min 𝑀 𝑁 + 𝑀(𝐸 ∣ 𝑁) # bits # bits for the for 𝑁 data using 𝑁
MDL example 𝑀 𝑁 + 𝑀(𝐸|𝑁) errors a 1 x + a 0 a 10 x 10 + a 9 x 9 + … + a 0 { }
Minimum Graph Description Given : - a graph 𝐻 with adjacency matrix 𝐵 - vocabulary Ω Find : model 𝑁 s.t. 𝑀(𝐻, 𝑁) = min 𝑀(𝑁) + 𝑀(𝐹) Adjacency 𝐵 Model 𝑁 Error 𝐹
VoG: Overview ≈? argmin ≈
VoG: Overview
VoG: Overview some criterion
VoG: Overview
VoG: Overview
VoG: Overview Summary
We need candidate structures … … How can we get them?
Step 1: Graph Decomposition We ca can n us use: Any ny decomposition method We We did d us use/a /adapt dapt: S LASH B URN
SnB Graph Decomposition Slash ash top-k hubs, burn edges Before
SnB Graph Decomposition Slash ash top-k hubs, burn edges
SnB Graph Decomposition Slash ash top-k hubs, burn edges candidate structures After
SnB Graph Decomposition Slash ash top-k hubs, burn edges candidate structures Notice that the structures can overlap ! After
SnB Graph Decomposition Slash ash top-k hubs, burn edges candidate structures After
SnB Graph Decomposition Slash ash top-k hubs, burn edges Repeat on the remaining GCC GCC
We got candidate structures. Now ow, how ow ca can we we ‘label’ them?
Recommend
More recommend