cu culprits ts an and isl island nds
play

Cu Culprits ts an and Isl Island nds Jill illes V s Vreeken - PowerPoint PPT Presentation

Cu Culprits ts an and Isl Island nds Jill illes V s Vreeken 4 4 Ju July 2014 2014 (TA TADA) Ser ervic ice Ann e Announ uncemen ent #1 Tensors Introduction - Introduction to tensors - Is DM science? - Tensors in DM - DM


  1. Cu Culprits ts an and Isl Island nds Jill illes V s Vreeken 4 4 Ju July 2014 2014 (TA TADA)

  2. Ser ervic ice Ann e Announ uncemen ent #1 Tensors Introduction - Introduction to tensors - Is DM science? - Tensors in DM - DM in action - Special topics in tensors Information Theory Mixed Grill - MDL + patterns - Influence Propagation - Entropy + correlation - Redescription Mining - MaxEnt + iterative DM - <special request>

  3. Ser ervic ice Ann e Announ uncemen ent #1 Tensors Introduction <special request>? - Introduction to tensors - Is DM science? - Tensors in DM - DM in action - Special topics in tensors Let us know (asap, mail) what topic you would Information Theory Mixed Grill like to see discussed - MDL + patterns - Influence Propagation - Entropy + correlation - Redescription Mining - MaxEnt + iterative DM - <special request>

  4. Who Who are the the Cu Culpri rits ts? B. Aditya Prakash Jill illes V s Vreeken Christos Faloutsos 4 4 Ju July 2014 2014 (TA TADA)

  5. Fir irst st q quest estio ion of the e da day How can we find the number and location of starting points for epidemics in graphs? – or – Who are the culprits?

  6. Virus P s Propaga gatio ion Susceptible-Infected (SI) Model [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and Diseases over contact networks their 1039 contacts

  7. Culp lprit its: Pr Problem blem d def efin init itio ion 2d grid Question: Who started it?

  8. Culp lprit its: Pr Problem blem d def efin init itio ion 2d grid Question: Who started it? Prior work: [Lappas et al. 2010, Shah et al. 2011]

  9. Culp lprit its: E Exo xoner eratio ion

  10. Culp lprit its: E Exo xoner eratio ion

  11. Who ho a are t e the c he culp lprit its Two-step solution 1) use MDL for number of seeds 2) for a given number: exoneration = centrality + penalty Running time linear! (in edges and nodes) NetSleuth

  12. Mo Modeling using deling using MDL MDL Minimum Description Length principle Induction by Compression Related to Bayesian approaches MDL = Model + Data Cost of a Model: scoring the seed-set Number of possible Encoding integer | 𝑇 | | 𝑇 | -sized sets

  13. Mo Modeling using deling using MDL MDL Encoding the Data: Propagation Ripples Infected Original Snapshot Graph Ripple R1 Ripple R2

  14. Mo Modeling using deling using MDL MDL Ripple cost Ripple R How the ‘frontier’ How long is the ripple advances Total MDL cost Prakash, Vreeken, Faloutsos 2012

  15. Ho How w to o opt ptim imiz ize e the sc e score? e? Two-step process  Given k quickly identify high-quality set S  Given set S , optimize the ripple R

  16. Op Optim imiz izin ing t the he sc score High-quality k- seed-set  exoneration Best single seed:  smallest eigenvector of Laplacian sub-matrix  analyze a Constrained SI epidemic Exonerate neighbors Repeat

  17. Op Optim imiz izin ing t the he sc score Optimizing R  Get the MLE ripple! Ripple R Finally use MDL score to tell us the best set N ET S LEUTH : Linear running time in nodes and edges

  18. Experi riments How far are they? Evaluation functions:  MDL based  Overlap based Closer to 1 ( JD = Jaccard distance) the better

  19. Experi riments: # # of f Seeds One Seed Two Seeds Three Seeds

  20. Exper xperim iments: s: Q Quali lity ( (MDL MDL and JD) D) One Seed Two Seeds Ideal = 1 Three Seeds Prakash, Vreeken, Faloutsos 2012

  21. Exper xperim iments: s: Q Quali lity ( (Jaccar ard Sc Scor ores) One Seed Two Seeds N ET S LEUTH Closer to True diagonal, the better Three Seeds

  22. Exper xperim iments: s: S Scala labili ility

  23. Conc nclu lusio ion Given : Graph and Infections Find : Best ‘Culprits’ Two-step solution  use MDL for number of seeds  for a given number: exoneration = centrality + penalty  NetSleuth :  Linear running time in nodes and edges

  24. Connection Pat Con athwa hways Lema Le man Ako koglu Jille Jilles Vree eeken en Hangh ghan ang Tong ong Pol olo o Ch Chau au Nik ikola laj T j Tatti Christ Ch stos s Falout outsos os (Akoglu et al. SDM’13)

  25. Quest uestio ion a at h hand nd How can we use a graph to explai ain a few sel selecte d nodes ?

  26. Giv Given en a a ‘list ‘list’ o ’ of a authors… What can we say?  let’s use relational information Christos Faloutsos Jeffrey F. Naughton Surajit Chaudhuri H. V. Jagadish Hiroshi Ishii Scott E. Hudson David J. DeWitt Gerhard Weikum Shumin Zhai Bonnie E. John William Buxton Abigail Sellen Hector Garcia Molina Raghu Ramakrishnan Steve Benford James A. Landay Michael J. Carey Ravin Balakrishnan Brad A. Myers Rakesh Agrawal

  27. Giv Given en a a ‘list ‘list’ o ’ of a authors… What can we say?  let’s use relational information

  28. Usin sing t g the c e co-aut uthorsh ship g graph… h… Any structure?  too cluttered

  29. Th The P e Problem blem Given  a large graph G  a handful of nodes S marked by an external process What can we say about S ?  are they close by ?  are they segregated ?  do they form groups ? Can we connect them?  with simple paths?  maybe using a few connectors ?

  30. Our Our a app pproach Use the network structure to explain S Partition S into groups of nodes, such that  “simple” paths in G connect the nodes in each group ,  nodes in different groups are “not easily reachable” Use MDL to decide ‘ simple ’ and ‘ best ‘ partitioning

  31. Example Simple connection pathways  good connectors  better sensemaking CHI VLDB

  32. App Applic licatio ions 1. Graph anomaly description/summarization e.g. Gene interaction network Top-k anomalies  Summarize top-k node anomalies by groups  Find connections/connectors among groups

  33. App Applic licatio ions 2. Query summarization e.g. Web network Top-ranked pages  Summarize top-k query pages by groups  Find connections/connectors among groups

  34. App Applic licatio ions 3. Understanding dynamic events in graphs e.g. Social network Affected people  Event spread within groups explained by the network  Event spread between groups due to external influence

  35. App Applic licatio ions 4. Understanding semantic coherence e.g. Ontology network Set of words  Summarize words by semantically coherent groups  Find connectors (other relevant words) per group

  36. App Applic licatio ions 5. Understanding segregation (social science) e.g. school-children friendship network Students with attributes of interest  Summarize students by their social “circles”  Study groups (and groups within groups)

  37. Problem: F For orma mally Problem Definition Given a graph G= ( V,E ) and a set of marked nodes M subseteq V Problem 1. Optimal partitioning Find a coherent partitioning P of M . Find the optimal number of partitions |P| . Problem 2. Optimal connection subgraphs Efficiently find the minimum cost set of subgraphs connecting the nodes in each part

  38. Ob Objec jectiv ive: e: Inf Informally ly Our key idea is to use information theory Imagine a sender and a receiver.  both sender and receiver know graph structure G ,  only the sender knows the set of marked nodes M  goal: transmit M using as few bits as possible . Why would this work?  naïve : encode ID of each marked node with bits  better : exploit “close-by” nodes, restart for farther nodes … u vs. …

  39. Ob Objec jectiv ive: e: Int Intuit itio ion We think of encoding as  hopping from node to node to encode close-by nodes  and flying to a new node to encode farther nodes  until all marked nodes are identified Simplicity of connection tree T is determined by:  the amount of flights we make across the graph;  ease of identifying the edges to follow next;  ease of identifying the marked nodes in our tour;

  40. Ob Objec jectiv ive: e: F Formall lly minimize P, T i  encode #partitions  encode each part root node spanning tree number of identities of t of p i marked nodes in p i marked nodes  encoding of tree per part recursively encode all #branches of node t identities of branch nodes tree nodes

  41. Solut lutio ion: In Intuit uitio ion It’s NP -hard. The problem is hard The problem is NP NP-hard rd  Reduces to directed Steiner tree problem  Related to the directed Steiner tree problem Hence, we resort to heuristics … The general idea:  transform G into a directed weighted graph G’  chop G’ into sub-graphs  find low-cost minimal spanning trees per sub-graph (we give 4 efficient algorithms)

  42. Solut lutio ion: P Prelim elimin inaries ies Graph transformation  given undirected unweighted  we transform it into directed weighted where and Given G’ , the problem becomes: find the set of trees with minimum total cost on the marked nodes. Finding bounded-length paths  (multiple) short paths of length up to between marked nodes in G’  employ BFS-like expansion

Recommend


More recommend