rumours in graphs
play

Rumours in Graphs Jilles Vreeken 24 July 2015 Service Announcement - PowerPoint PPT Presentation

Rumours in Graphs Jilles Vreeken 24 July 2015 Service Announcement #1 The Exam. 20 minutes per person 1) Questions can be on any topic covered in 2) the lectures 1) the required reading 2) the assignments (1 topic per assignment, your


  1. Rumours in Graphs Jilles Vreeken 24 July 2015

  2. Service Announcement #1 The Exam. 20 minutes per person 1) Questions can be on any topic covered in 2) the lectures 1) the required reading 2) the assignments (1 topic per assignment, your choice) 3) Grade will be based on your performance in the exam, 3) minus any Bonus points you may have acquired. Timeslots will be mailed today. 4)

  3. Service Announcement #2 Introduction Patterns Correlation and Causation (Subjective) Interestingness Graphs Wrap-up + < ask-me-anything>

  4. Service Announcement #2 <ask-me-anything>? Introduction Yes! Prepare questions on anything* Patterns you’ve always wanted to ask me. Correlation and Causation Mail them to me in advance, (Subjective) Interestingness or have me answer on the spot Graphs * preferably related to Wrap-up + < ask-us-anything> TADA, data mining, machine learning, science, the world, etc.

  5. Service Announcement #3 Next week there is a high chance of choco colat late or, if the weather permits, ice cream

  6. Who Who ar are th e the e Cu Culp lprits rits? B. Aditya Prakash Jilles les Vreeken eeken Christos Faloutsos

  7. First que uest stio ion n of the he da day How can we find the number and location of starting points for epidemics in graphs? (Prakash, Vreeken & Faloutsos, ICDM 2012)

  8. First que uest stio ion n of the he da day Who are the culprits? (Prakash, Vreeken & Faloutsos, ICDM 2012)

  9. Virus Propagation Susceptible-Infected (SI) Model [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and Diseases over contact networks their 1039 contacts

  10. Culprits: Problem definition 2d grid Question: Who started it?

  11. Related Work – Culprits (Partial)  Shah and Zaman, IEEE TIT, 2011  One seed.  Provably finds MLE seed for d-regular trees  SI process  Lappas et. al., KDD, 2010.  k seeds (takes in Input k)  Infected graph assumed to be in steady-state  IC model

  12. Culprits: Problem definition 2d grid Question: Who started it?

  13. Culprits: Exoneration

  14. Culprits: Exoneration

  15. Who are the culprits Two-step solution 1) use MDL for number of seeds 2) for a given number: exoneration = centrality + penalty Running time linear! (in edges and nodes) N ET S LEUTH

  16. Modeling using MDL Minimum Description Length principle Induction by Compression Related to Bayesian approaches MDL = Model + Data Cost of a Model: scoring the seed-set Number of possible Encoding integer |𝑇| |𝑇| -sized sets

  17. Modeling using MDL Encoding the Data: Propagation Ripples Infected Original Snapshot Graph Ripple R1 Ripple R2

  18. Modeling using MDL Ripple cost Ripple R How the ‘frontier’ How long is the ripple advances Total MDL cost

  19. How to optimize the score? Two-step process  Given k quickly identify high-quality set S  Given set S , optimize the ripple R

  20. Optimizing the score High-quality k- seed-set  exoneration Best single seed:  smallest eigenvector of Laplacian sub-matrix  analyze a Const strai rained ned SI epidemic Exonerate neighbors Repeat

  21. Optimizing the score Optimizing R  Get the MLE ripple! Ripple R Finally use MDL score to tell us the best set N ET S LEUTH : Linear running time in nodes and edges

  22. Experiments How far are they? Evaluation functions:  MDL based  Overlap based Closer to 1 the better ( JD = Jaccard distance)

  23. Experiments: # of Seeds One Seed Two Seeds Three Seeds

  24. Experiments: Quality (MDL and JD) One Seed Two Seeds Ideal = 1 Three Seeds

  25. Experiments: Quality (Jaccard Scores) One Seed Two Seeds N ET S LEUTH Closer to True diagonal, Three Seeds the better

  26. Experiments: Scalability

  27. Intermediate Conclusion Giv iven: Graph and Infections Fin ind : Best ‘Culprits’ Two wo-step ep solution  use MDL MDL for number of seeds  for a given number: exo xonerat neration on = centrality + penalty  NetSle Sleuth th:  Linear running time in nodes and edges

  28. Hidden Hazards Sashidar Sundareirsan Jilles Vreeken B. Aditya Prakash

  29. But: Real data is noisy! We don’t know who exactly are infected  Epidemiology CDC  Public-health surveillance Lab Hospital Not sure ? CNN ? Surveillance Pyramid headlines Not sure [Nishiura+, PLoS ONE 2011] Each level has a certain probability to miss some truly infected people

  30. Real data is noisy! Correcting missing data is by itself very important Social Media  Twitter: due to the uniform samples [Morstatter+ 2013] , the relevant ‘infected’ tweets may be missed Tweets Missing ? Sampled Tweets ? Missing Sampling

  31. Third que uest stio ion n of the he da day Given a sample ple of the infectees, how can we find out the number and location of starting points of the epidemic, as well as the mis issing sing nodes des? (Sundareisan et al. SDM’15)

  32. The Problem  GIVEN:  Graph 𝐻(𝑊, 𝐹) from historical data  Infected set 𝐸 ⊂ 𝑊 , sampled ( 𝑞% ) and incomplete  Infectivity 𝛾 of the virus (assumed to follow the SI model)  FIND:  Seed set i.e. patient zeros/culprits  Set 𝐷 − (the missing infected nodes)  Ripple 𝑆 (the order of infections)

  33. Related Work – Missing Nodes (Partial) Costenbader & Valente 2003; Kossinets 2006, Borgatti et al. 2006  study the effect of sampling on macro level network statistics Adiga et. al. 2013  sensitivity of total infections to noise in network structure Sadikov et al., WSDM, 2011  correct for sampling for macro level cascade statistics

  34. Outline  Motivation---Introduction  Problem Definition  Our Appr proach ch  MDL  Decoupling  Finding 𝒯 given 𝐷  Finding 𝐷 given 𝒯  Experiments  Conclusion

  35. MDL Encoding For Our Problem The Model Seeds ( 𝒯 ), Ripple ( 𝑆 ) Missing nodes ( 𝐷 − ) Sender Receiver Graph 𝐻(𝑊, 𝐹) Graph 𝐻(𝑊, 𝐹) Infectivity ( 𝛾 ) Infectivity ( 𝛾 ) Data given Sampling ( 𝑞 ) Sampling ( 𝑞 ) model Seeds ( 𝒯 ) Infected set ( 𝐸 ) Ripple ( 𝑆 ) Missing nodes ( 𝐷 − )

  36. Model ( 𝑇, 𝑆 ) Cost How to score a seed set ( 𝒯 ) Number of possible Encoding integer | 𝒯 | | 𝒯 |-sized sets How to score the ripple?

  37. Model (𝑇, 𝑆) Cost Scoring a ripple ( 𝑆 ) Infected Original Snapshot Graph Ripple Ripple 𝑆 1 𝑆 2

  38. Model (𝑇, 𝑆) Cost Ripple cost Ripple R How the ‘frontier’ How long is the ripple advances

  39. Cost of the data (C-) Now you know too much – for you to know what was 𝐸 we need to transmit which are the missed nodes 𝐷 − (green nodes) Detail: 𝛿 = 1 – 𝑞 i.e. the probability of a node to be truly missing

  40. T otal MDL Cost Finally, we have 𝑀 𝐸, 𝒯, 𝑆 = 𝑀 𝒯 + 𝑀 𝑆 𝒯 + 𝑀(𝐸 ∣ 𝒯, 𝑆) Our problem is now to find those 𝒯, 𝑆, 𝐷 − that minimize it

  41. Outline  Motivation---Introduction  Problem Definition  Our Appr proach ch  MDL  Decoup upli ling  Finding S given C  Finding C given S  Experiments  Conclusion

  42. Our Approach: Decoupling The two problems are 1) finding the seeds and ripple (𝒯, 𝑆) 2) finding the missing nodes ( 𝐷 − ) Can we decouple these problems?

  43. Decoupling the problems (contd.) Finding seeds ds depends nds on missing sing nodes. Legend Missing nodes Seed Infected node N ET F ILL : N ET S LEUTH : correctly fills in the no missing nodes as input, nodes missing from input no missing nodes as output

  44. Decoupling the problems (cont.) Finding missing sing nodes es also o depends nds on seeds. Not Infected Infected Most probably A was missed B Seed S A

  45. Outline  Motivation---Introduction  Problem Definition  Our Appr proach ch  MDL  Decoupling  Finding ng 𝒯 give ven 𝑫  Finding 𝐷 given 𝒯  Experiments  Conclusion

  46. Finding missing nodes ( C − ) and culprits ( 𝒯 ) 1) Suppose an oracle gives us the missing nodes ( 𝐷 − ) 2) We have complete infected set ( 𝐸 ∪ 𝐷 − ) 3) Apply N ET S LEUTH directly NO SAMPLING INVOLVED And will give us the seed set! Legend Missing nodes Applying NetSleuth* on Seed Oracle’s Answer Infected node

  47. Outline  Motivation---Introduction  Problem Definition  Our Appr proach ch  MDL  Decoupling  Finding 𝒯 given 𝐷  Finding ng 𝑫 give ven 𝒯  Experiments  Conclusion

  48. Missing Nodes (C-) given (S) Oracle gives us 𝒯 , find 𝐷 − The Naïve Approach:  Find all possible 𝐷 −  Pick the best one according to MDL Sadly, this is infeasible in practice, 𝑊 as we would have to score sets 𝑊 ∖ 𝐸

  49. Our Approach Sub-problem 1  |Seeds| = 1  |Missing nodes| = 1 Sub-problem 2  Finding the right number of missing nodes. Sub-problem 3  |Seeds| > 1

  50. Sub Problem 1: Best hidden hazard given one seed The best node is the one that makes the seed 𝑇 most likely  we use empirical risk as the measure Sanity ty Chec eck: ideally risk should be 0 So, the best hidden hazard is

Recommend


More recommend