CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu
(1) New problem: Outbreak detection (2) Develop an approximation algorithm It is a submodular opt. problem! (3) Speed-up greedy hill-climbing Valid for optimizing general submodular functions (i.e., also works for influence maximization) (4) Prove a new “data dependent” bound on the solution quality Valid for optimizing general submodular functions (i.e., also works for influence maximization) 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
[Leskovec et al., KDD ’07] Given a real city water distribution network And data on how contaminants spread in the network Detect the contaminant as quickly as possible S S Problem posed by the US Environmental Protection Agency 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
[Leskovec et al., KDD ’07] Posts Blogs Information cascade Time ordered hyperlinks Which blogs should one read to detect cascades as effectively as possible? 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 2-4
[Leskovec et al., KDD ’07] Want to read things before others do. Detect blue & yellow soon but miss red . Detect all stories but late . 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Both of these two are an instance of the same underlying problem! Given a dynamic process spreading over a network We want to select a set of nodes to detect the process effectively Many other applications: Epidemics Influence propagation Network security 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Utility of placing sensors: Water flow dynamics, demands of households, … For each subset S ⊆ V compute utility f(S) High impact Low impact Contamination outbreak outbreak Medium impact S 3 outbreak S 1 S 2 S 3 S 4 S 2 S 1 Sensor reduces impact through S 4 early detection! Set V of all network junctions S 1 Low sensing quality f(S)=0.01 High sensing quality f(S) = 0.9 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
[Leskovec et al., KDD ’07] Given: Graph 𝐻 ( 𝑊 , 𝐹 ) Data on how outbreaks spread over the 𝑯 : For each outbreak 𝑗 we know the time 𝑈 ( 𝑗 , 𝑣 ) when outbreak 𝑗 contaminates node 𝑣 Simulator of water consumption&flow Water distribution network (built by Mech Eng. people) (physical pipes and junctions) We simulate the contamination spread for every possible location. 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
[Leskovec et al., KDD ’07] Given: Graph 𝐻 ( 𝑊 , 𝐹 ) Data on how outbreaks spread over the 𝑯 : For each outbreak 𝑗 we know the time 𝑈 ( 𝑗 , 𝑣 ) when outbreak 𝑗 contaminates node 𝑣 c a b a c b Traces of the information flow The network of Collect lots of blogs posts and trace the blogosphere hyperlinks to obtain data about information flow from a given blog. 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
[Leskovec et al., KDD ’07] Given: Graph 𝐻 ( 𝑊 , 𝐹 ) Data on how outbreaks spread over the 𝑯 : For each outbreak 𝑗 we know the time 𝑈 ( 𝑗 , 𝑣 ) when outbreak 𝑗 contaminates node 𝑣 Goal: Select a subset of nodes S that maximize the expected reward : max 𝑇⊆𝑊 𝑔 𝑇 = � 𝑄 𝑗 𝑔 𝑗 𝑇 𝑗 Expected reward for detecting outbreak i subject to: cost(S) < B 𝒈 𝒋 𝑻 is penalty reduction: 𝑔 𝑗 𝑇 = 𝜌 𝑗 ∅ − 𝜌 𝑗 ( 𝑇 ) 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
Reward (1) Minimize time to detection (2) Maximize number of detected propagations (3) Minimize number of infected people Cost (node/location dependent): Reading big blogs is more time consuming Placing a sensor in a remote location is expensive outbreak i f(S) Monitoring blue node saves more people than monitoring the green node 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
Objective functions: 1) Time to detection (DT) How long does it take to detect a contamination? Penalty: 𝜌 𝑗 ( 𝑢 ) = min { 𝑢 , 𝑈 𝑛𝑛𝑛 } 2) Detection likelihood (DL) How many contaminations do we detect? We incur penalty if we don’t detect: 𝜌 𝑗 ( 𝑢 ) = 0 , 𝜌 𝑗 ( ∞ ) = 1 3) Population affected (PA) How many people drank contaminated water? 𝜌 𝑗 ( 𝑢 ) = {# of blogs in cascade 𝑗 at time 𝑢 }. Observation: In all cases detecting sooner does not hurt! 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
[Leskovec et al., KDD ’07] Observation: Diminishing returns New sensor: S 1 S 1 S’ s’ S 2 S 3 S 2 S 4 Placement S={s 1 , s 2 } Placement S’={s 1 , s 2 , s 3 , s 4 } Adding s’ helps Adding s’ helps a lot very little 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Claim: For all 𝐵 ⊆ 𝐶 ⊆ 𝑊 and sensors 𝑡 ∈ 𝑊 \B 𝑔 𝐵 ∪ 𝑡 − 𝑔 𝐵 ≥ 𝑔 𝐶 ∪ 𝑡 − 𝑔 𝐶 Proof: Fix cascade 𝑗 Show 𝑔 𝑗 𝐵 = 𝜌 𝑗 ∞ − 𝜌 𝑗 ( 𝑈 ( 𝐵 , 𝑗 )) is submodular Consider 𝐵 ⊆ 𝐶 ⊆ 𝑊 and sensor 𝑡 ∈ 𝑊 \B When does node 𝒕 detect cascade 𝒋 ? 3 Cases: (1) 𝑈 𝑡 , 𝑗 ≥ 𝑈 ( 𝐵 , 𝑗 ) then 𝑔 𝑗 𝐵 ∪ 𝑡 = 𝑔 𝑗 𝐵 , 𝑔 𝑗 𝐶 ∪ 𝑡 = 𝑔 𝑗 𝐶 and so 𝑔 𝑗 𝐵 ∪ 𝑡 − 𝑔 𝑗 𝐵 = 0 = 𝑔 𝑗 𝐶 ∪ 𝑡 − 𝑔 𝑗 𝐶 Since 𝑡 detects too late, nobody benefits 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Proof (contd.): 3 Cases: (2) 𝑈 𝐶 , 𝑗 ≤ 𝑈 𝑡 , 𝑗 < 𝑈 ( 𝐵 , 𝑗 ) then 𝑔 𝑗 𝐵 ∪ 𝑐 − 𝑔 𝑗 𝐵 ≥ 0 = 𝑔 𝑗 𝐶 ∪ 𝑡 − 𝑔 𝑗 𝐶 𝑡 detects sooner than any node in 𝐵 but after all in 𝐶 . So 𝑣 only helps improve the solution 𝐵 . (3) 𝑈 𝑡 , 𝑗 < 𝑈 ( 𝐶 , 𝑗 ) then 𝑔 𝑗 𝐵 ∪ 𝑡 − 𝑔 𝑗 𝐵 = 𝜌 𝑗 ∞ − 𝜌 𝑗 𝑈 𝑡 , 𝑗 − 𝑔 𝑗 ( 𝐵 ) ≥ 𝜌 𝑗 ∞ − 𝜌 𝑗 𝑈 𝑡 , 𝑗 − 𝑔 𝑗 ( 𝐶 ) = 𝑔 𝑗 𝐶 ∪ 𝑡 − 𝑔 𝑗 𝐶 Ineqaulity is due to non-decreasingness of 𝑔 𝑗 ( ⋅ ) , i.e., 𝑔 𝑗 𝐵 ≤ 𝑔 𝑗 ( 𝐶 ) So, 𝒈 𝒋 ( ⋅ ) is submodular! So, 𝒈 ( ⋅ ) is also submodular 𝑔 𝑇 = � 𝑄 𝑗 𝑔 𝑗 𝑇 𝑗 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
What do we know about optimizing submodular Hill-climbing functions? reward A hill-climbing (i.e., greedy) is near d a 1 optimal ( 1 − 𝑓 ⋅ 𝑃𝑄𝑈 ) b b a But: c e (1) This only works for unit cost case! c (each sensor costs the same) d For use each sensor 𝑡 has cost 𝑑 ( 𝑡 ) e (2) Hill-climbing algorithm is slow Add sensor with At each iteration we need to re-evaluate highest marginal gain marginal gains of all nodes Runtime 𝑃 (| 𝑊 | · 𝐿 ) for placing 𝐿 sensors 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 2-16
[Leskovec et al., KDD ’07] Consider: Hill-climbing that ignores cost Ignore sensor cost Repeatedly select sensor with highest marginal gain Do this until the budget is exhausted How well does this work? 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
[Leskovec et al., KDD ’07] Bad example: 𝑜 sensors, budget 𝐶 𝑡 1 : reward 𝑠 , cost 𝐶 𝑡 2 … 𝑡 𝑜 : reward 𝑠 − 𝜁 , cost 1 Hill-climbing always prefers more expensive sensor 𝑡 1 with reward 𝑠 (and exhausts the budget) It never selects cheaper sensors with reward 𝑠 − 𝜁 → For variable cost it can fail arbitrarily badly! Idea: What if we optimize benefit-cost ratio ? 𝑔 𝐵 𝑗−1 ∪ { 𝑡 } − 𝑔 ( 𝐵 𝑗−1 ) Greedily pick sensor 𝑡 𝑗 = arg max 𝑡 𝑗 that maximizes 𝑑 𝑡 𝑡∈𝑊 benefit to cost ratio. 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
Recommend
More recommend