Beyond Convexity Submodularity in Machine Learning Andreas Krause, - PowerPoint PPT Presentation

Tutorial Overview Examples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, max Submodularity and convexity Minimizing submodular functions Maximizing submodular functions Extensions and research directions 26

Submodularity and Convexity Carnegie Mellon

Submodularity and convexity For V = {1,…,n}, and A ⊆ V, let w A = (w 1A ,…,w nA ) with w iA = 1 if i ∈ A, 0 otherwise Key result [Lovasz ’83]: Every submodular function F induces a function g on R n+ , such that F(A) = g(w A ) for all A ⊆ V g(w) is convex min A F(A) = min w g(w) s.t. w ∈ [0,1] n Let’s see how one can define g(w) 28

The submodular polyhedron P F 2 � '&#;& ∈ � , A&;��& ≤ ��&/.�&�""&�& ⊆ �% Example: V = {a,b} A F(A) ∅ ;��&'& ∑ � ∈ � ; � 0 {a} -1 {b} 2 {a,b} 0 ; #3% ;�#3%�& ≤ ��#3%� 2 2 � 1 ;�#�$3%�& ≤ ��#�$3%� ; #�% �� -1 0 1 ;�#�%�& ≤ ��#�%� 29

Lovasz extension Claim : g(w) = max x ∈ PF w T x 2 � '&#;& ∈ � , A&;��& ≤ ��&/.�&�""&�& ⊆ �% ; � =argmax x ∈ PF w T x � #3% ; � 5��'� B ; � 2 � 1 � #�% -1 0 1 -2 Evaluating g(w) requires solving a linear program with exponentially many constraints � 30

; #3% ; � 2 Evaluating the Lovasz extension � 1 ; #�% g(w) = max x ∈ PF w T x -1 0 1 -2 2 � '&#;& ∈ � , A&;��& ≤ ��&/.�&�""&�& ⊆ �% Theorem [Edmonds ’71, Lovasz ‘83]: For any given w, can get optimal solution x w to the LP using the following greedy algorithm: � Order V={e 1 ,…,e n } so that w(e 1 ) ≥ … ≥ w(e n ) 1. Let x w (e i ) = F({e 1 ,…,e i }) – F({e 1 ,…,e i-1 }) 2. w T x w = g(w) = max x ∈ PF w T x Then Sanity check: If w = w A and A={e 1 ,…,e k }, then w A T x*= ∑ i=1k [F({e 1 ,…,e i )-F({e 1 ,…,e i-1 )] = F(A) 31

Example: Lovasz extension A F(A) g(w) = max {w T x: x ∈ P F } ∅ 0 � #3% {a} -1 {b} 2 2 [-2,2] {a,b} 0 {b} {a,b} [-1,1] 1 w=[0,1] {} {a} want g(w) � #�% �� -1 0 1 Greedy ordering: e 1 = b, e 2 = a � w(e 1 )=1 > w(e 2 )=0 g([0,1]) = [0,1] T [-2,2] = 2 = F({b}) x w (e 1 )=F({b})-F( ∅ )=2 x w (e 2 )=F({b,a})-F({b})=-2 g([1,1]) = [1,1] T [-1,1] = 0 = F({a,b}) � x w =[-2,2] 32

Why is this useful? ; #3% [0,1] 2 2 Theorem [Lovasz ’83]: 1 g(w) attains its minimum in [0,1] n at a corner! ; #�% 0 1 If we can minimize g on [0,1] n , can minimize F… (at corners, g and F take same values) g(w) convex F(A) submodular (and efficient to evaluate) Does the converse also hold? No, consider g(w 1 ,w 2 ,w 3 ) = max(w 1 ,w 2 +w 3 ) {a} {b} {c} F({a,b})-F({a})=0 < F({a,b,c})-F({a,c})=1 33

Tutorial Overview Examples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, max Submodularity and convexity Every SF induces a convex function with SAME minimum Special properties: Greedy solves LP over exponential polytope Minimizing submodular functions Maximizing submodular functions Extensions and research directions 34

Minimization of submodular functions Carnegie Mellon

Overview minimization Minimizing general submodular functions Minimizing symmetric submodular functions Applications to Machine Learning 36

Minimizing a submodular function Want to solve A* = argmin A F(A) Need to solve min w max x w T x g(w) s.t. w ∈ [0,1] n , x ∈ P F Equivalently: min c,w c c ≥ w T x for all x ∈ P F s.t. w ∈ [0,1] n This is an LP with infinitely many constraints! 37

Ellipsoid algorithm [Grötschel, Lovasz, Schrijver ’81] Feasible region Optimality direction :�, �$� �& �)-)&�& ≥ � B ;&/.�&�""&; ∈ 2 � � ∈ C($�D , Separation oracle: Find most violated constraint: max x w T x – c s.t. x ∈ P F Can solve separation using the greedy algorithm!! � Ellipsoid algorithm minimizes SFs in poly-time! 38

Minimizing submodular functions Ellipsoid algorithm not very practical Want combinatorial algorithm for minimization! Theorem [Iwata (2001)] There is a fully combinatorial, strongly polynomial algorithm for minimizing SFs, that runs in time O(n 8 log 2 n) Polynomial-time = Practical ??? 39

A more practical alternative? [Fujishige ’91, Fujishige et al ‘06] ;�#�$3%�'��#�$3%� ; #3% A F(A) ∅ 0 {a} -1 2 ;� {b} 2 6��&0.">-.0�A {a,b} 0 1 [-1,1] 6 � '&2 � � #;��&'&��% ; #�% �� -1 0 1 � Minimum norm algorithm: Find x* = argmin ||x|| 2 s.t. x ∈ B F x*=[-1,1] 1. Return A* = {i: x*(i) < 0} A*={a} 2. Theorem [Fujishige ’91]: A* is an optimal solution! Note: Can solve 1. using Wolfe’s algorithm Runtime finite but unknown!! � 40

Empirical comparison [Fujishige et al ’06] E4-&/4,�-�.,�& /�.:&FG!�E�& Lower is better (log-scale!) E �""�,5� Running time (seconds) !�,�:4: ,.�:&�"5.��- : 64 128 256 512 1024 Problem size (log-scale!) Minimum norm algorithm orders of magnitude faster! Our implementation can solve n = 10k in < 6 minutes! 41

Checking optimality (duality) 6 � A F(A) � #3% ∅ 0 2 {a} -1 ;� {b} 2 1 [-1,1] {a,b} 0 Base polytope: � #�% �� -1 0 1 6 � '&2 � � #;��&'&��% Theorem [Edmonds ’70] A = {a}, F(A) = -1 min A F(A) = max x {x – (V) : x ∈ B F } w = [1,0] where x – (s) = min {x(s), 0} x w = [-1,1] x w- = [-1,0] Testing how close A’ is to min A F(A) x w- (V) = -1 � A optimal! Run greedy algorithm for w=w A’ to get x w 1. F(A’) ≥ min A F(A) ≥ x w– (V) 2. 42

Overview minimization Minimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n^8) Practical alternative: Minimum norm algorithm? Minimizing symmetric submodular functions Applications to Machine Learning 43

What if we have special structure? Worst-case complexity of best known algorithm: O(n 8 log 2 n) Can we do better for special cases? Example (again): Given RVs X 1 ,…,X n F(A) = I(X A ; X V \ A ) = I(X V \ A ; X A ) = F(V \ A) Functions F with F(A) = F(V \ A) for all A are symmetric 44

Another example: Cut functions � � � �'#�$3$�$1$�$/$5$ % � � � 5 � � � � � � � � 3 1 / � � � ��&'& ∑ #� �$- A&� ∈ �$&- ∈ � \ �% Example: F({a})=6; F({c,d})=10; F({a,b,c,d})=2 Cut function is symmetric and submodular! 45

Minimizing symmetric functions For any A, submodularity implies = F(A) + F(V \ A) 2 F(A) ≥ F(A � (V \ A))+F(A ∪ (V \ A)) = F( ∅ ) + F(V) = 2 F( ∅ ) = 0 Hence, any symmetric SF attains minimum at ∅ In practice, want nontrivial partition of V into A and V \ A, i.e., require that A is neither ∅ of V Want A* = argmin F(A) s.t. 0 < |A| < n There is an efficient algorithm for doing that! ☺ 46

Queyranne’s algorithm (overview) [Queyranne ’98] � Theorem : There is a fully combinatorial, strongly polynomial algorithm for solving A* = argmin A F(A) s.t. 0<|A|<n for symmetric submodular functions A Runs in time O(n 3 ) [instead of O(n 8 )…] Note: also works for “posimodular” functions: F posimodular � A,B ⊆ V: F(A)+F(B) ≥ F(A \ B)+F(B \ A) 47

Gomory Hu trees � � � � � * � � � 5 � � � 5 � � � � * � � � � � � �( 3 1 / 3 1 / � � � � � A tree T is called Gomory-Hu (GH) tree for SF F if for any s, t ∈ V it holds that min {F(A): s ∈ A and t ∉ A} = min {w i,j : (i,j) is an edge on the s-t path in T} “min s-t-cut in T = min s-t-cut in G” Expensive to Theorem [Queyranne ‘93]: find one in GH-trees exist for any symmetric SF F! general! � 48

Pendent pairs For function F on V, s,t ∈ V: (s,t) is pendent pair if {s} ∈ argmin A F(A) s.t. s ∈ A, t ∉ A Pendent pairs always exist: � � * � � � 5 Gomory-Hu � * � �( tree T 3 1 / Take any leaf s and neighbor t, then (s,t) is pendent! E.g., (a,c), (b,c), (f,e), … Theorem [Queyranne ’95]: Can find pendent pairs in O(n 2 ) (without needing GH-tree!) 49

Why are pendent pairs useful? Key idea: Let (s,t) pendent, A* = argmin F(A) Then EITHER � �� s and t separated by A*, e.g., s ∈ A*, t ∉ A*. But then A*={s}!! OR � - s and t are not separated by A* � � �� - - Then we can merge s and t… 50

Merging Suppose F is a symmetric SF on V, and we want to merge pendent pair (s,t) Key idea: “If we pick s, get t for free” V’ = V \ {t} = F(A ∪ {t}) if s ∈ A, or F’(A) = F(A) if s ∉ A Lemma : F’ is still symmetric and submodular! � � � � � � � � 5 �$� � 5 !��5� � � � ��$�� 3 1 / 3 1 / � � � � � � 51

Queyranne’s algorithm Input : symmetric SF F on V, |V|=n Output : A* = argmin F(A) s.t. 0 < |A| < n Initialize F’ ← F, and V’ ← V For i = 1:n-1 ← pendentPair(F’,V’) (s,t) � A i = {s} � (F’,V’) ← merge(F’,V’,s,t) � Return argmin i F(A i ) Running time: O(n 3 ) function evaluations 52

Note: Finding pendent pairs Initialize v 1 ← x (x is arbitrary element of V) 1. For i = 1 to n-1 do 2. W i ← {v 1 ,…,v i } 1. v i+1 ← argmin v F(W i ∪ {v}) - F({v}) s.t. v ∈ V \ W i 2. Return pendent pair (v n-1 ,v n ) 3. Requires O(n 2 ) evaluations of F 53

Overview minimization Minimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n 8 ) Practical alternative: Minimum norm algorithm? Minimizing symmetric submodular functions Many useful submodular functions are symmetric Queyranne’s algorithm minimize symmetric SFs in O(n 3 ) Applications to Machine Learning 54

Application: Clustering [Narasimhan, Jojic, Bilmes NIPS ’05] Group data points V into “homogeneous clusters” . � Find a partition �'� � ∪ H ∪ � � . . . . that minimizes . � � . . . � � . . �� $H$� � �&'& ∑ � I�� “Inhomogeneity of A i ” Examples for E(A): Entropy H(A) Cut function Special case: k = 2. Then F(A) = E(A) + E(V \ A) is symmetric! If E is submodular, can use Queyranne’s algorithm! ☺ 55

What if we want k>2 clusters? [Zhao et al ’05, Narasimhan et al ‘05] � � � � � � � Greedy Splitting algorithm � � � � � � � � Start with partition P = {V} For i = 1 to k-1 � � � � � � � � � � For each member C j ∈ P do � � � � split cluster C j : A* = argmin E(A) + E(C j \ A) s.t. 0<|A|<|C j | P j ← P \ {C j } ∪ {A,C j \ A} � � � � � � Partition we get by splitting j-th cluster � � � � � � � � P ← argmin j F(P j ) Theorem: F(P) ≤ (2-2/k) F(P opt ) 56

Example: Clustering species [Narasimhan et al ‘05] E.::.,&5�,�-��&�,/.�:�-�.,&'&J./&�.::.,&�43�-��,5�A E�,&��">&�;-�,1&-.&��-�&./&�0�� 57

Example: Clustering species [Narasimhan et al ‘05] The common genetic information I CG does not require alignment captures genetic similarity is smallest for maximally evolutionarily diverged species is a symmetric submodular function! ☺ Greedy splitting algorithm yields phylogenetic tree! 58

Example: SNPs [Narasimhan et al ‘05] Study human genetic variation (for personalized medicine, …) Most human variation due to point mutations that occur once in human history at that base location: Single Nucleotide Polymorphisms (SNPs) Cataloging all variation too expensive ($10K-$100K per individual!!) 59

SNPs in the ACE gene [Narasimhan et al ‘05] Rows: Individuals. Columns: SNPs. Which columns should we pick to reconstruct the rest? Can find near-optimal clustering (Queyranne’s algorithm) 60

Reconstruction accuracy [Narasimhan et al ‘05] Comparison with clustering based on Entropy Prediction accuracy Pairwise correlation J&./&�"4�-�� PCA 61

Example: Speaker segmentation [Reyes-Gomez, Jojic ‘07] Region A “Fiona” Mixed waveforms Frequency Alice “???” Time Fiona “???” E(A)=-log p(X A ) Partition Likelihood of Spectro- “region” A gram “308” ��'I��I�� \ �� using Q-Algo symmetric “217” & posimodular 62

Example: Image denoising 63

Example: Image denoising Pairwise Markov Random Field � � � � � � 2�; � $H$; , $> � $H$> , � � � � � � � '& ∏ �$L ψ �$L �> � $> L �& Π � φ � �; � $> � � � � � � � � Want ��5:�; > 2�> 9&;� � � � � � � '��5:�; > ".5&2�;$>� � � � K � * '��5:�, > ∑ �$L I �$L �> � $> L �� ∑ � I � �> � � � � � K � * I �$L �> � $> L �&'&�".5& ψ �$L �> � $> L � X i : noisy pixels Y i : “true” pixels When is this MAP inference efficiently solvable (in high treewidth graphical models)? 64

MAP inference in Markov Random Fields [Kolmogorov et al, PAMI ’04, see also: Hammer, Ops Res ‘65] Energy E(y) = ∑ i,j E i,j (y i ,y j )+ ∑ i E i (y i ) Suppose y i are binary, define F(A) = E(y A ) where y Ai = 1 iff i ∈ A Then min y E(y) = min A F(A) Theorem MAP inference problem solvable by graph cuts � For all i,j: E i,j (0,0)+E i,j (1,1) ≤ E i,j (0,1)+E i,j (1,0) � each E i,j is submodular “Efficient if prefer that neighboring pixels have same color” 65

Constrained minimization Have seen: if F submodular on V, can solve A ∈ V A*=argmin F(A) s.t. What about A ∈ V and |A| ≤ k A*=argmin F(A) s.t. E.g., clustering with minimum # points per cluster, … In general, not much known about constrained minimization � However, can do A*=argmin F(A) s.t. 0<|A|< n A*=argmin F(A) s.t. |A| is odd/even [Goemans&Ramakrishnan ‘95] A*=argmin F(A) s.t. A ∈ argmin G(A) for G submodular [Fujishige ’91] 66

Overview minimization Minimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n 8 ) Practical alternative: Minimum norm algorithm? Minimizing symmetric submodular functions Many useful submodular functions are symmetric Queyranne’s algorithm minimize symmetric SFs in O(n 3 ) Applications to Machine Learning Clustering [Narasimhan et al’ 05] Speaker segmentation [Reyes-Gomez & Jojic ’07] MAP inference [Kolmogorov et al ’04] 67

Tutorial Overview Examples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, max Submodularity and convexity Every SF induces a convex function with SAME minimum Special properties: Greedy solves LP over exponential polytope Minimizing submodular functions Minimization possible in polynomial time (but O(n 8 )…) Queyranne’s algorithm minimizes symmetric SFs in O(n 3 ) Useful for clustering, MAP inference, structure learning, … Maximizing submodular functions Extensions and research directions 68

Maximizing submodular functions Carnegie Mellon

Maximizing submodular functions Minimizing convex functions: Minimizing submodular functions: Polynomial time solvable! Polynomial time solvable! Maximizing convex functions: Maximizing submodular functions: NP hard! NP hard! But can get approximation guarantees ☺ 70

Maximizing influence [Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Alice 0.2 0.5 0.4 0.2 0.3 0.5 Bob 0.5 Fiona Charlie F(A) = Expected #people influenced when targeting A F monotonic: If A ⊆ B: F(A) ≤ F(B) Hence V = argmax A F(A) More interesting: argmax A F(A) – Cost(A) 71

Maximizing non-monotonic functions :�;�:4: Suppose we want for not monotonic F A* = argmax F(A) s.t. A ⊆ V 9�9 Example: F(A) = U(A) – C(A) where U(A) is submodular utility, and C(A) is supermodular cost function E.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08] In general: NP hard. Moreover: If F(A) can take negative values: As hard to approximate as maximum independent set (i.e., NP hard to get O(n 1- ε ) approximation) 72

Maximizing positive submodular functions [Feige, Mirrokni, Vondrak FOCS ’07] Theorem There is an efficient randomized local search procedure, that, given a positive submodular function F, F( ∅ )=0, returns set A LS such that F(A LS ) ≥ (2/5) max A F(A) picking a random set gives ¼ approximation (½ approximation if F is symmetric!) we cannot get better than ¾ approximation unless P = NP 73

Scalarization vs. constrained maximization Given monotonic utility F(A) and cost C(A), optimize: Option 1: Option 2: max A F(A) – C(A) max A F(A) s.t. A ⊆ V s.t. C(A) ≤ B “Scalarization” “Constrained maximization” Can get 2/5 approx… coming up… if F(A)-C(A) ≥ 0 for all A ⊆ V Positiveness is a strong requirement � 74

Constrained maximization: Outline Selected set Monotonic submodular Selection cost Budget Subset selection: C(A) = |A| Complex constraints Robust optimization 75

Monotonicity A set function is called monotonic if A ⊆ B ⊆ V ⇒ F(A) ≤ F(B) Examples: Influence in social networks [Kempe et al KDD ’03] For discrete RVs, entropy F(A) = H(X A ) is monotonic: Suppose B=A ∪� C. Then F(B) = H(X A , X C ) = H(X A ) + H(X C | X A ) ≥ H(X A ) = F(A) Information gain: F(A) = H(Y)-H(Y | X A ) Set cover Matroid rank functions (dimension of vector spaces, …) … 76

Subset selection Given: Finite set V, monotonic submodular function F, F( ∅ ) = 0 Want: A * ⊆ V such that NP-hard! 77

Exact maximization of monotonic submodular functions 1) Mixed integer programming [Nemhauser et al ’81] :�;& η �)-)& η ≤ ��6�&�& ∑ � ∈ � \ 6 α � δ � �6�&/.�&�""&6& ⊆ � ∑ � α � ≤ � α � ∈ #($�% � ��& δ � �6�&'&��6& ∪ #�%�&< ��6� Solved using constraint generation � 2) Branch-and-bound: “Data-correcting algorithm” [Goldengorin et al ’99] Both algorithms worst-case exponential! 78

Approximate maximization Given: finite set V, monotonic submodular function F(A) Want: A * ⊆ V such that NP-hard! Y “Sick” � Greedy algorithm: X 1 X 2 X 3 Start with A 0 = ∅ “Fever” “Rash” “Male” For i = 1 to k s i := argmax s F(A i-1 ∪ {s}) - F(A i-1 ) A i := A i-1 ∪ {s i } 79

Performance of greedy algorithm Theorem [Nemhauser et al ‘78] Given a monotonic submodular function F, F( ∅ )=0, the greedy maximization algorithm returns A greedy F(A greedy ) ≥ (1-1/e) max |A| ≤ k F(A) �� Sidenote : Greedy algorithm gives 1/2 approximation for maximization over any matroid C! [Fisher et al ’78] 80

An “elementary” counterexample X 1 , X 2 ~ Bernoulli(0.5) X 1 X 2 Y = X 1 XOR X 2 Y Let F(A) = IG(X A ; Y) = H(Y) – H(Y|X A ) Y | X 1 and Y | X 2 ~ Bernoulli(0.5) (entropy 1) Y | X 1 ,X 2 is deterministic! (entropy 0) Hence F({1,2}) – F({1}) = 1, but F( ∅ ) F({2}) – = 0 F(A) submodular under some conditions! (later) 81

Example: Submodularity of info-gain Y 1 ,…,Y m , X 1 , …, X n discrete RVs F(A) = IG(Y; X A ) = H(Y)-H(Y | X A ) F(A) is always monotonic However, NOT always submodular Theorem [Krause & Guestrin UAI’ 05] If X i are all conditionally independent given Y, then F(A) is submodular! Hence, greedy algorithm works! � � � � � � In fact, NO algorithm can do better � � � � � � � � than (1-1/e) approximation! 82

Building a Sensing Chair [Mutlu, Krause, Forlizzi, Guestrin, Hodgins UIST ‘07] People sit a lot Activity recognition in assistive technologies Seating pressure as user interface Equipped with 1 sensor per cm 2 ! Costs $16,000! � Lean Lean Slouch left forward Can we get similar 82% accuracy on accuracy with fewer, cheaper sensors? 10 postures! [Tan et al] 83

How to place sensors on a chair? Sensor readings at locations V as random variables Predict posture Y using probabilistic model P(Y,V) Pick sensor locations A* ⊆ V to minimize entropy: 2.��3"�&".��-�.,�& � Placed sensors, did a user study: Accuracy Cost $16,000 � � � � Before 82% $100 ☺ ☺ ☺ ☺ After 79% Similar accuracy at <1% of the cost! 84

Variance reduction (a.k.a. Orthogonal matching pursuit, Forward Regression) Let Y = ∑ i α i X i + ε , and (X 1 ,…,X n , ε ) ∼ N( � ; µ , Σ ) Want to pick subset X A to predict Y Var(Y | X A =x A ): conditional variance of Y given X A = x A Expected variance: Var(Y | X A ) = ∫ p(x A ) Var(Y | X A =x A ) dx A Variance reduction: F V (A) = Var(Y) – Var(Y | X A ) F V (A) is always monotonic Theorem [Das & Kempe, STOC ’08] *under some conditions on Σ F V (A) is submodular* � Orthogonal matching pursuit near optimal! � � � [see other analyses by Tropp, Donoho et al., and Temlyakov] 85

Batch mode active learning [Hoi et al, ICML’06] . . Which data points o should we � . < . . label to minimize error? . . . � . . . Want batch A of k points to . < . . show an expert for labeling F(A) selects examples that are uncertain [ σ 2 (s) = π (s) (1- π (s)) is large] diverse (points in A are as different as possible) relevant (as close to V \ A is possible, s T s’ large) F(A) is submodular and monotonic! [approximation to improvement in Fisher-information] 86

Results about Active Learning [Hoi et al, ICML’06] Batch mode Active Learning performs better than Picking k points at random Picking k points of highest entropy 87

Monitoring water networks [Krause et al, J Wat Res Mgt 2008] Contamination of drinking water could affect millions of people Contamination ��,�.�� Simulator from EPA Hach Sensor Place sensors to detect contaminations ~ $14K “Battle of the Water Sensor Networks” competition Where should we place sensors to quickly detect contamination? 88

Model-based sensing Utility of placing sensors based on model of the world For water networks: Water flow simulator from EPA F(A)=Expected impact reduction placing sensors at A Model predicts Low impact High impact Theorem [Krause et al., J Wat Res Mgt ’08] : Contamination location Impact reduction F(A) in water networks is submodular! Medium impact � � location � � � � � � � � � � � � Sensor reduces � � impact through Set V of all early detection! � � network junctions Low impact reduction F(A)=0.01 High impact reduction F(A) = 0.9 89

Battle of the Water Sensor Networks Competition Real metropolitan area network (12,527 nodes) Water flow simulator provided by EPA 3.6 million contamination events Multiple objectives: Detection time, affected population, … Place sensors that detect well “on average” 90

Bounds on optimal solution [Krause et al., J Wat Res Mgt ’08] 1.4 Offline Population protected F(A) (Nemhauser) 1.2 bound Higher is better 1 0.8 Water 0.6 Greedy networks solution 0.4 data 0.2 0 0 5 10 15 20 Number of sensors placed (1-1/e) bound quite loose… can we get better bounds? 91

Data dependent bounds [Minoux ’78] Suppose A is candidate solution to argmax F(A) s.t. |A| ≤ k and A* = {s 1 ,…,s k } be an optimal solution Then F(A*) ≤ F(A ∪ A*) = F(A)+ ∑ i F(A ∪ {s 1 ,…,s i })-F(A ∪ {s 1 ,…,s i-1 }) ≤ F(A) + ∑ i (F(A ∪ {s i })-F(A)) = F(A) + ∑ i δ si For each s ∈ V \ A, let δ s = F(A ∪ {s})-F(A) � Order such that δ 1 ≥ δ 2 ≥ … ≥ δ n k δ i Then: F(A*) ≤ F(A) + ∑ i=1 92

Bounds on optimal solution [Krause et al., J Wat Res Mgt ’08] 1.4 Offline (Nemhauser) 1.2 ��,��,5&M4�"�->&�� bound Data-dependent 8�5 ��&��&3�--�� 1 bound 0.8 Water 0.6 Greedy networks solution 0.4 data 0.2 0 0 5 10 15 20 N4:3��&./&��,�.��&0"��1 Submodularity gives data-dependent bounds on the performance of any algorithm 93

BWSN Competition results [Ostfeld et al., J Wat Res Mgt 2008] 13 participants Performance measured in 30 different criteria G: Genetic algorithm D: Domain knowledge H: Other heuristic E: “Exact” method (MIP) �( �� E H G 8�5 ��&��&3�--�� ( B.-�"&��.�� G E �� H G �( G D H D � G ( , ) ) � ) �� ) 1 � � " " 4 � � " " " " " � � � � , � " � . � . � � 4 & & " . & : 1 " & - 0 " / - � - - � @ - � � � 2 : � � � - + � � & & � & & P . � , 5 � > � O , � & " 6 � , � & > O � � � � O 4 � " . � & � � . & � . O @ 4 - 4 B 6 2 & � � F O + 8 � & � 0 � O � . 1 24% better performance than runner-up! ☺ � ☺ ☺ ☺ � 2 � : � " 2 � � � / 1 - � @ � P � " I 94

What was the trick? Simulated all on 2 weeks / 40 processors 3.6M contaminations 152 GB data on disk , 16 GB in main memory (compressed) � Very accurate computation of F(A) Very slow evaluation of F(A) � 30 hours/20 sensors Running time (minutes) �.��&��&3�--�� 300 6 weeks for all Exhaustive search (All subsets) 30 settings � � � � 200 Naive greedy ubmodularity 100 to the rescue 0 1 2 3 4 5 6 7 8 9 10 Number of sensors selected 95

Scaling up greedy algorithm [Minoux ’78] In round i+1, have picked A i = {s 1 ,…,s i } pick s i+1 = argmax s F(A i ∪ {s})-F(A i ) I.e., maximize “marginal benefit” δ s (A i ) δ s (A i ) = F(A i ∪ {s})-F(A i ) Key observation: Submodularity implies δ s (A i ) ≥ δ δ δ s (A i+1 ) δ δ δ δ i ≤ j ⇒ δ s (A i ) ≥ δ s (A j ) s Marginal benefits can never increase! 96

“Lazy” greedy algorithm [Minoux ’78] � Lazy greedy algorithm: Benefit δ s (A) � First iteration as usual a a a � Keep an ordered list of marginal benefits δ δ i from previous iteration δ δ b b d � Re-evaluate δ δ i only for top δ δ b c c element e d d � If δ i stays on top, use it, otherwise re-sort c e e Note: Very easy to compute online bounds, lazy evaluations, etc. [Leskovec et al. ’07] 97

Result of lazy evaluation Simulated all on 2 weeks / 40 processors 3.6M contaminations 152 GB data on disk , 16 GB in main memory (compressed) � Very accurate computation of F(A) Very slow evaluation of F(A) � 30 hours/20 sensors Running time (minutes) �.��&��&3�--�� 300 6 weeks for all Exhaustive search (All subsets) 30 settings � � � � 200 Naive greedy ubmodularity 100 Fast greedy to the rescue: 0 Using “lazy evaluations”: 1 2 3 4 5 6 7 8 9 10 1 hour/20 sensors Number of sensors selected Done after 2 days! ☺ ☺ ☺ ☺ 98

What about worst-case? [Krause et al., NIPS ’07] Knowing the sensor locations, an adversary contaminates here! � � � � � � � � � � � � � � � � Placement detects Very different average-case impact, well on “average-case” Same worst-case impact (accidental) contamination Where should we place sensors to quickly detect in the worst case ? 99

Constrained maximization: Outline Utility function Selected set Selection cost Budget Subset selection Complex constraints Robust optimization 100

Beyond Convexity Submodularity in Machine Learning Andreas Krause, - PowerPoint PPT Presentation

Beyond Convexity Submodularity in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University International Conference on Machine Learning | July 5, 2008 Carnegie Mellon Acknowledgements Thanks for slides and material to

Optimal covering of a straight line application to discrete convexity Jean-Marc Chassery, Isabelle

Convexity and Polyhedra Carlo Mannino (from Geir Dahl notes on convexity) University of Oslo,

A Tightrope Walk Between Convexity and Non-convexity in Computer Vision Thomas Pock Institute

Discrete convexity and packages Gleb Koshevoy IITP(RAS) and Poncelet Center (CNRS) 12/05/2020,

Convexity and the Kalmbach monad Gejza Jena August 10, 2018 Gejza Jena Convexity and the

Beyond Convenience: Beyond Convexity Purushottam Kar MINI-SYMPOSIUM ON COMPUTATION AND

3. Convex functions basic properties and examples operations that preserve convexity

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Unit 1: Convexity Mathematics II Departament de Matemtiques per a lEconomia i lEmpresa

Summary Key topics. Familiarity with form of basic network gradient. Deep network

Summary Key topics. Familiarity with form of basic network gradient. Deep network

Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob

MEDIA DISRUPTION SEEING BEYOND SEEING BEYOND SEEING BEYOND SEEING BEYOND LED BY THE BLIND

Optimization in Alibaba: Beyond Convexity System for AI AI for system Jian Tan Computer

Human Development Report 2019 Beyond income, beyond averages, beyond today: Inequalities in human

Moving Beyond Market Moving Beyond Market Fundamentalism to a Fundamentalism to a More Balanced

Talk 3: On the Classical Limit of Quantum Mechanics Bruce Driver Department of Mathematics,

For Tuesday Read chapter 12, sections 1-4 Homework: Chapter 10, exercise 9 Chapter

A Review of Fact-Checking, Fake News Detection and Argumentation Tariq Alhindi March 02, 2020

Semi-Supervised Learning and Text Analysis Machine Learning 10-701 November 29, 2005 Tom M.

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun

More ore on on BN BNets ets str tructure ucture an and d cons onstruction truction

FEVER IN THE ICU Infectious Diseases in Clinical Practice February 2014 Jennifer Babik, MD, PhD

Health Search From Consumers to Clinicians Slides available at