http cs224w stanford edu 1 new problem outbreak detection
play

http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) - PowerPoint PPT Presentation

CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) Develop an approximation algorithm It is a submodular opt. problem! (3) Speed-up greedy


  1. CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu

  2. ¡ (1) New problem: Outbreak detection ¡ (2) Develop an approximation algorithm § It is a submodular opt. problem! ¡ (3) Speed-up greedy hill-climbing § Valid for optimizing general submodular functions (i.e., also works for influence maximization) ¡ (4) Prove a new “data dependent” bound on the solution quality § Valid for optimizing any submodular function (i.e., also works for influence maximization) 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 2

  3. ¡ Given a real city water distribution network ¡ And data on how contaminants spread in the network ¡ Detect the contaminant as quickly as possible S S ¡ Problem posed by the US Environmental Protection Agency 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 3

  4. Posts Users/blogs Information cascade Time ordered hyperlinks Which users/news sites should one follow to detect cascades as effectively as possible? 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4

  5. Want to read things before others do. Detect blue & yellow stories soon but miss the red story . Detect all stories but late . 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5

  6. ¡ Both of these two are instances of the same underlying problem! ¡ Given a dynamic process spreading over a network we want to select a set of nodes to detect the process effectively ¡ Many other applications: § Epidemics § Influence propagation § Network security 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 6

  7. ¡ Utility of placing sensors: § Water flow dynamics, demands of households, … ¡ For each subset S Í V compute utility f(S) High impact Low impact outbreak outbreak Contamination Medium impact S 3 outbreak S 1 S 2 S 3 S 4 S 2 S 1 Sensor reduces impact through S 4 early detection! Set V of all network junctions S 1 Low sensing “quality” (e.g. f(S)=0.01) High sensing “quality” (e.g., f(S) = 0.9) 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7

  8. Given: ¡ Graph 𝐻(𝑊, 𝐹) ¡ Data about how outbreaks spread over the 𝑯 : § For each outbreak 𝑗 we know the time 𝑈(𝑣, 𝑗) when outbreak 𝑗 contaminates node 𝑣 Simulator of water consumption & flow Water distribution network (built by Mech. Eng. people) (physical pipes and junctions) We simulate the contamination spread for every possible location. 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8

  9. Given: ¡ Graph 𝐻(𝑊, 𝐹) ¡ Data about how outbreaks spread over the 𝑯 : § For each outbreak 𝑗 we know the time 𝑈(𝑣, 𝑗) when outbreak 𝑗 contaminates node 𝑣 c a b a c b Traces of the information flow and identify influence sets The network of Collect lots of articles and trace them to news media obtain data about information flow from a given news site. 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9

  10. Given: ¡ Graph 𝐻(𝑊, 𝐹) ¡ Data on how outbreaks spread over the 𝑯 : § For each outbreak 𝑗 we know the time 𝑈(𝑣, 𝑗) when outbreak 𝑗 contaminates node 𝑣 ¡ Goal: Select a subset of nodes S that maximizes the expected reward : max .⊆0 𝑔 𝑇 = 4 𝑄 𝑗 𝑔 5 𝑇 5 Expected reward for detecting outbreak i subject to: cost(S) < B P(i) … probability of outbreak i occurring. f(i) … reward for detecting outbreak i using sensors S . 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10

  11. ¡ Reward (one of the following three): § (1) Minimize time to detection § (2) Maximize number of detected propagations § (3) Minimize number of infected people ¡ Cost (context dependent): § Reading big blogs is more time consuming § Placing a sensor in a remote location is expensive 8 5 11 9 2 outbreak i 1 6 f(S) 3 10 Monitoring blue node saves more people than monitoring the green node 7 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 11

  12. ¡ Penalty 𝝆 𝒋 (𝒖) for detecting outbreak 𝒋 at time 𝒖 § 1) Time to detection ( DT ) § How long does it take to detect a contamination? § Penalty for detecting at time 𝒖 : 𝜌 5 (𝑢) = 𝑢 § 2) Detection likelihood ( DL ) § How many contaminations do we detect? § Penalty for detecting at time 𝒖 : 𝜌 5 (𝑢) = 0 , 𝜌 5 (∞) = 1 § Note, this is binary outcome: we either detect or not § 3) Population affected ( PA ) § How many people drank contaminated water? § Penalty for detecting at time 𝒖 : 𝜌 5 (𝑢) = {# of infected nodes in outbreak 𝑗 by time 𝑢 }. ¡ Observation: In all cases detecting sooner does not hurt! 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 12

  13. We define 𝒈 𝒋 𝑻 as penalty reduction: 𝑔 5 𝑇 = 𝜌 5 ∞ − 𝜌 5 (𝑈(𝑇, 𝑗)) ¡ Observation: Diminishing returns New sensor: x 1 x 1 x’ S’ x 2 x 3 x 2 x 4 Placement S={x 1 , x 2 } Placement S’={x 1 , x 2 , x 3 , x 4 } Adding x’helps Adding x’helps a lot very little 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13

  14. ¡ Claim: For all 𝑩 ⊆ 𝑪 ⊆ 𝑾 and sensor 𝒚 ∈ 𝑾\𝑪 𝒈 𝑩 ∪ 𝒚 − 𝒈 𝑩 ≥ 𝒈 𝑪 ∪ 𝒚 − 𝒈 𝑪 ¡ Proof: All our objectives are submodular § Fix outbreak 𝒋 § Show 𝒈 𝒋 𝑩 = 𝝆 𝒋 ∞ − 𝝆 𝒋 (𝑼(𝑩, 𝒋)) is submodular § Consider 𝑩 ⊆ 𝑪 ⊆ 𝑾 and sensor 𝒚 ∈ 𝑾\𝑪 § When does sensor 𝒚 detect outbreak 𝒋 ? § We analyze 3 cases based on when 𝒚 detects outbreak i § (1) 𝑼 𝑪, 𝒋 ≤ 𝑼 𝑩, 𝒋 < 𝑼(𝒚, 𝒋) : 𝒚 detects late, nobody benefits: 𝑔 5 𝐵 ∪ 𝑦 = 𝑔 5 𝐵 , also 𝑔 5 𝐶 ∪ 𝑦 = 𝑔 5 𝐶 and so 𝑔 5 𝐵 ∪ 𝑦 − 𝑔 5 𝐵 = 0 = 𝑔 5 𝐶 ∪ 𝑦 − 𝑔 5 𝐶 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14

  15. Remember 𝑩 ⊆ 𝑪 ¡ Proof (contd.): § (2) 𝑼 𝑪, 𝒋 ≤ 𝑼 𝒚, 𝒋 ≤ 𝑼 𝑩, 𝒋 : 𝒚 detects after B but before A 𝒚 detects sooner than any node in 𝑩 but after all in 𝑪 . So 𝒚 only helps improve the solution 𝑩 (but not 𝑪) 𝑔 5 𝐵 ∪ 𝑦 − 𝑔 5 𝐵 ≥ 0 = 𝑔 5 𝐶 ∪ 𝑦 − 𝑔 5 𝐶 § (3) 𝑼 𝒚, 𝒋 < 𝑼 𝑪, 𝒋 ≤ 𝑼(𝑩, 𝒋) : 𝒚 detects early 𝑔 5 𝐵 ∪ 𝑦 − 𝑔 5 𝐵 = 𝜌 5 ∞ − 𝜌 5 𝑈 𝑦, 𝑗 − 𝑔 5 (𝐵) ≥ 𝜌 5 ∞ − 𝜌 5 𝑈 𝑦, 𝑗 − 𝑔 5 (𝐶) = 𝑔 5 𝐶 ∪ 𝑦 − 𝑔 5 𝐶 § Inequality is due to non-decreasingness of 𝑔 5 (⋅) , i.e., 𝑔 5 𝐵 ≤ 𝑔 5 (𝐶) § So, 𝒈 𝒋 (⋅) is submodular! ¡ So, 𝒈(⋅) is also submodular 𝑔 𝑇 = 4 𝑄 𝑗 𝑔 5 𝑇 5 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 15

  16. ¡ What do we know about optimizing submodular Hill-climbing functions? reward § Hill-climbing (i.e., greedy) is near d a 𝟐 optimal: (𝟐 − 𝒇 ) ⋅ 𝑷𝑸𝑼 b b a ¡ But: c e § (1) This only works for unit cost c case! (each sensor costs the same) d § For us each sensor 𝒕 has cost 𝒅(𝒕) e § (2) Hill-climbing algorithm is slow Add sensor with § At each iteration we need to re-evaluate highest marginal gain marginal gains of all nodes § Runtime 𝑷(|𝑾| · 𝑳) for placing 𝑳 sensors 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu Part 2-16

  17. ¡ Consider the following algorithm to solve the outbreak detection problem: Hill-climbing that ignores cost § Ignore sensor cost 𝒅(𝒕) § Repeatedly select sensor with highest marginal gain § Do this until the budget is exhausted ¡ Q: How well does this work? ¡ A: It can fail arbitrarily badly! L § There exists a problem setting where the hill-climbing solution is arbitrarily far from OPT § Next we come up with an example 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 18

  18. ¡ Bad example when we ignore cost: § 𝒐 sensors, budget 𝑪 § 𝒕 𝟐 : reward 𝒔 , cost 𝑪 , , § 𝒕 𝟑 … 𝒕 𝒐 : reward 𝒔 − 𝜻 , c = 𝟐 § Hill-climbing always prefers more expensive sensor 𝒕 𝟐 with reward 𝒔 (and exhausts the budget). It never selects cheaper sensors with reward 𝒔 − 𝜻 → For variable cost it can fail arbitrarily badly! ¡ Idea: What if we optimize benefit-cost ratio ? 𝑔 𝐵 5fg ∪ {𝑡} − 𝑔(𝐵 5fg ) Greedily pick sensor 𝑡 5 = arg max 𝒕 𝒋 that maximizes 𝒅 𝒕 d∈(0\e) benefit to cost ratio. 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19

Recommend


More recommend