Cu Culprits ts an and Isl Island nds Jill illes V s Vreeken 4 4 Ju July 2014 2014 (TA TADA)
Ser ervic ice Ann e Announ uncemen ent #1 Tensors Introduction - Introduction to tensors - Is DM science? - Tensors in DM - DM in action - Special topics in tensors Information Theory Mixed Grill - MDL + patterns - Influence Propagation - Entropy + correlation - Redescription Mining - MaxEnt + iterative DM - <special request>
Ser ervic ice Ann e Announ uncemen ent #1 Tensors Introduction <special request>? - Introduction to tensors - Is DM science? - Tensors in DM - DM in action - Special topics in tensors Let us know (asap, mail) what topic you would Information Theory Mixed Grill like to see discussed - MDL + patterns - Influence Propagation - Entropy + correlation - Redescription Mining - MaxEnt + iterative DM - <special request>
Who Who are the the Cu Culpri rits ts? B. Aditya Prakash Jill illes V s Vreeken Christos Faloutsos 4 4 Ju July 2014 2014 (TA TADA)
Fir irst st q quest estio ion of the e da day How can we find the number and location of starting points for epidemics in graphs? – or – Who are the culprits?
Virus P s Propaga gatio ion Susceptible-Infected (SI) Model [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and Diseases over contact networks their 1039 contacts
Culp lprit its: Pr Problem blem d def efin init itio ion 2d grid Question: Who started it?
Culp lprit its: Pr Problem blem d def efin init itio ion 2d grid Question: Who started it? Prior work: [Lappas et al. 2010, Shah et al. 2011]
Culp lprit its: E Exo xoner eratio ion
Culp lprit its: E Exo xoner eratio ion
Who ho a are t e the c he culp lprit its Two-step solution 1) use MDL for number of seeds 2) for a given number: exoneration = centrality + penalty Running time linear! (in edges and nodes) NetSleuth
Mo Modeling using deling using MDL MDL Minimum Description Length principle Induction by Compression Related to Bayesian approaches MDL = Model + Data Cost of a Model: scoring the seed-set Number of possible Encoding integer | 𝑇 | | 𝑇 | -sized sets
Mo Modeling using deling using MDL MDL Encoding the Data: Propagation Ripples Infected Original Snapshot Graph Ripple R1 Ripple R2
Mo Modeling using deling using MDL MDL Ripple cost Ripple R How the ‘frontier’ How long is the ripple advances Total MDL cost Prakash, Vreeken, Faloutsos 2012
Ho How w to o opt ptim imiz ize e the sc e score? e? Two-step process Given k quickly identify high-quality set S Given set S , optimize the ripple R
Op Optim imiz izin ing t the he sc score High-quality k- seed-set exoneration Best single seed: smallest eigenvector of Laplacian sub-matrix analyze a Constrained SI epidemic Exonerate neighbors Repeat
Op Optim imiz izin ing t the he sc score Optimizing R Get the MLE ripple! Ripple R Finally use MDL score to tell us the best set N ET S LEUTH : Linear running time in nodes and edges
Experi riments How far are they? Evaluation functions: MDL based Overlap based Closer to 1 ( JD = Jaccard distance) the better
Experi riments: # # of f Seeds One Seed Two Seeds Three Seeds
Exper xperim iments: s: Q Quali lity ( (MDL MDL and JD) D) One Seed Two Seeds Ideal = 1 Three Seeds Prakash, Vreeken, Faloutsos 2012
Exper xperim iments: s: Q Quali lity ( (Jaccar ard Sc Scor ores) One Seed Two Seeds N ET S LEUTH Closer to True diagonal, the better Three Seeds
Exper xperim iments: s: S Scala labili ility
Conc nclu lusio ion Given : Graph and Infections Find : Best ‘Culprits’ Two-step solution use MDL for number of seeds for a given number: exoneration = centrality + penalty NetSleuth : Linear running time in nodes and edges
Connection Pat Con athwa hways Lema Le man Ako koglu Jille Jilles Vree eeken en Hangh ghan ang Tong ong Pol olo o Ch Chau au Nik ikola laj T j Tatti Christ Ch stos s Falout outsos os (Akoglu et al. SDM’13)
Quest uestio ion a at h hand nd How can we use a graph to explai ain a few sel selecte d nodes ?
Giv Given en a a ‘list ‘list’ o ’ of a authors… What can we say? let’s use relational information Christos Faloutsos Jeffrey F. Naughton Surajit Chaudhuri H. V. Jagadish Hiroshi Ishii Scott E. Hudson David J. DeWitt Gerhard Weikum Shumin Zhai Bonnie E. John William Buxton Abigail Sellen Hector Garcia Molina Raghu Ramakrishnan Steve Benford James A. Landay Michael J. Carey Ravin Balakrishnan Brad A. Myers Rakesh Agrawal
Giv Given en a a ‘list ‘list’ o ’ of a authors… What can we say? let’s use relational information
Usin sing t g the c e co-aut uthorsh ship g graph… h… Any structure? too cluttered
Th The P e Problem blem Given a large graph G a handful of nodes S marked by an external process What can we say about S ? are they close by ? are they segregated ? do they form groups ? Can we connect them? with simple paths? maybe using a few connectors ?
Our Our a app pproach Use the network structure to explain S Partition S into groups of nodes, such that “simple” paths in G connect the nodes in each group , nodes in different groups are “not easily reachable” Use MDL to decide ‘ simple ’ and ‘ best ‘ partitioning
Example Simple connection pathways good connectors better sensemaking CHI VLDB
App Applic licatio ions 1. Graph anomaly description/summarization e.g. Gene interaction network Top-k anomalies Summarize top-k node anomalies by groups Find connections/connectors among groups
App Applic licatio ions 2. Query summarization e.g. Web network Top-ranked pages Summarize top-k query pages by groups Find connections/connectors among groups
App Applic licatio ions 3. Understanding dynamic events in graphs e.g. Social network Affected people Event spread within groups explained by the network Event spread between groups due to external influence
App Applic licatio ions 4. Understanding semantic coherence e.g. Ontology network Set of words Summarize words by semantically coherent groups Find connectors (other relevant words) per group
App Applic licatio ions 5. Understanding segregation (social science) e.g. school-children friendship network Students with attributes of interest Summarize students by their social “circles” Study groups (and groups within groups)
Problem: F For orma mally Problem Definition Given a graph G= ( V,E ) and a set of marked nodes M subseteq V Problem 1. Optimal partitioning Find a coherent partitioning P of M . Find the optimal number of partitions |P| . Problem 2. Optimal connection subgraphs Efficiently find the minimum cost set of subgraphs connecting the nodes in each part
Ob Objec jectiv ive: e: Inf Informally ly Our key idea is to use information theory Imagine a sender and a receiver. both sender and receiver know graph structure G , only the sender knows the set of marked nodes M goal: transmit M using as few bits as possible . Why would this work? naïve : encode ID of each marked node with bits better : exploit “close-by” nodes, restart for farther nodes … u vs. …
Ob Objec jectiv ive: e: Int Intuit itio ion We think of encoding as hopping from node to node to encode close-by nodes and flying to a new node to encode farther nodes until all marked nodes are identified Simplicity of connection tree T is determined by: the amount of flights we make across the graph; ease of identifying the edges to follow next; ease of identifying the marked nodes in our tour;
Ob Objec jectiv ive: e: F Formall lly minimize P, T i encode #partitions encode each part root node spanning tree number of identities of t of p i marked nodes in p i marked nodes encoding of tree per part recursively encode all #branches of node t identities of branch nodes tree nodes
Solut lutio ion: In Intuit uitio ion It’s NP -hard. The problem is hard The problem is NP NP-hard rd Reduces to directed Steiner tree problem Related to the directed Steiner tree problem Hence, we resort to heuristics … The general idea: transform G into a directed weighted graph G’ chop G’ into sub-graphs find low-cost minimal spanning trees per sub-graph (we give 4 efficient algorithms)
Solut lutio ion: P Prelim elimin inaries ies Graph transformation given undirected unweighted we transform it into directed weighted where and Given G’ , the problem becomes: find the set of trees with minimum total cost on the marked nodes. Finding bounded-length paths (multiple) short paths of length up to between marked nodes in G’ employ BFS-like expansion
Recommend
More recommend