announcements
play

Announcements Homework 2: Due Thursday Feb 19 Project milestone - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 12 Submodularity CS 101.2 Andreas Krause Announcements Homework 2: Due Thursday Feb 19 Project milestone due: Feb 24 4 Pages, NIPS format:


  1. Active Learning and Optimized Information Gathering Lecture 12 – Submodularity CS 101.2 Andreas Krause

  2. Announcements Homework 2: Due Thursday Feb 19 Project milestone due: Feb 24 4 Pages, NIPS format: http://nips.cc/PaperInformation/StyleFiles Should contain preliminary results (model, experiments, proofs, …) as well as timeline for remaining work Come to office hours to discuss projects! Office hours Come to office hours before your presentation! Andreas: Monday 3pm-4:30pm , 260 Jorgensen Ryan: Wednesday 4:00-6:00pm, 109 Moore 2

  3. Course outline Online decision making 1. Statistical active learning 2. Combinatorial approaches 3. 3

  4. Medical diagnosis Want to predict medical condition of patient given noisy symptoms / tests Body temperature healthy sick Rash on skin Treatment -$$ $ Cough Increased antibodies No treatment 0 -$$$ in blood Abnormal MRI Treating a healthy patient is bad, not treating a sick patient is terrible Each test has a (potentially different) cost Which tests should we perform to make most effective decisions? 4

  5. Value of information Prior P(Y) obs X i = x i Posterior P(Y | x i ) Reward Value of information: Reward[ P(Y | x i ) ] = max a EU(a | x i ) Reward can by any function of the distribution P(Y | x i ) Important examples: Posterior variance of Y Posterior entropy of Y 5

  6. Optimal value of information Can we efficiently optimize value of information? � Answer depends on properties of the distribution P(X 1 ,…,X n ,Y) Theorem [Krause & Guestrin IJCAI ’05]: If the random variables form a Markov Chain, can find optimal (exponentially large!) decision tree in polynomial time ☺ There exists a class of distributions for which we can perform efficient inference (i.e., compute P(Y|X i )), where finding the optimal decision tree is NP PP hard 6

  7. Approximating value of information? If we can’t find an optimal solution, can we find provably near-optimal approximations?? 7

  8. Feature selection Given random variables Y, X 1 , … X n Want to predict Y from subset X A = (X i1 ,…,X ik ) Y Naïve Bayes Model “Sick” X 1 X 2 X 3 “Fever” “Rash” “Male” Want k most informative features: A* = argmax IG(X A ; Y) s.t. |A| ≤ k where IG(X A ; Y) = H(Y) - H(Y | X A ) Uncertainty Uncertainty before knowing X A after knowing X A 8

  9. Example: Greedy algorithm for feature selection Given: finite set V of features, utility function F(A) = IG(X A ; Y) Want: A * ⊆ V such that � ������ NP-hard! � � � � � � Greedy algorithm: ������� ������ ������ Start with A = ∅ For i = 1 to k s* := argmax s F(A ∪ {s}) A := A ∪ {s*} How well can this simple heuristic do? 9

  10. Key property: Diminishing returns Selection A = {} Selection B = {X 2 ,X 3 } � � ������ ������ � � � � ������ ������ � � Adding X 1 Adding X 1 Theorem [Krause, Guestrin UAI ‘05] : Information gain F(A) in ������� will help a lot! doesn’t help much Naïve Bayes models is submodular! New feature X 1 + s Large improvement Submodularity: A B + s Small improvement For A ⊆ B, F(A ∪ {s}) – F(A) ≥ F(B ∪ {s}) – F(B) 10

  11. Why is submodularity useful? Theorem [Nemhauser et al ‘78] Greedy maximization algorithm returns A greedy : F(A greedy ) ≥ (1-1/e) max |A| ≤ k F(A) ���� Greedy algorithm gives near-optimal solution! For info-gain: Guarantees best possible unless P = NP! [Krause, Guestrin UAI ’05] Submodularity is an incredibly useful and powerful concept! 11

  12. Set functions Finite set V = {1,2,…,n} Function F: 2 V → R Will always assume F( ∅ ) = 0 (w.l.o.g.) Assume black-box that can evaluate F for any input A Approximate (noisy) evaluation of F is ok Example: F(A) = IG(X A ; Y) = H(Y) – H(Y | X A ) = ∑ y,xA P(x A ) [log P(y | x A ) – log P(y)] � � ������ ������ � � � � � � � � ������� ������ ������ ������ ���� � �� � �������� ���� � �� � ������� 12

  13. Submodular set functions Set function F on V is called submodular if For all A,B ⊆ V: F(A)+F(B) ≥ F(A ∪ B)+F(A � B) ≥ + + B A A ∪ B A � B Equivalent diminishing returns characterization: + S Large improvement Submodularity: A B + S Small improvement For A ⊆ B, s ∉ B, F(A ∪ {s}) – F(A) ≥ F(B ∪ {s}) – F(B) 13

  14. Submodularity and supermodularity Set function F on V is called submodular if 1) For all A,B ⊆ V: F(A)+F(B) ≥ F(A ∪ B)+F(A � B) � 2) For all A ⊆ B, s ∉ B, F(A ∪ {s}) – F(A) ≥ F(B ∪ {s}) – F(B) F is called supermodular if –F is submodular F is called modular if F is both sub- and supermodular for modular (“additive”) F, F(A) = ∑ i ∈ A w(i) 14

  15. Example: Set cover !�"#�#$��$����%�$$�&��" '�#��(���� )�������"�$�� �"�*+��(�", )$���*�� ������ ������ ����� ����� �$��#�$"�� ���������� ������� - ��� ���� ���� ������ ������� For A ⊆ V: F(A) = “area Node predicts covered by sensors placed at A” values of positions with some radius Formally: W finite set, collection of n subsets S i ⊆ W For A ⊆ V={1,…,n} define F(A) = | � i ∈ A S i | 15

  16. Set cover is submodular .��� � �� � � ������ ������ ����� ����� ���������� � � � � ������� ��� ���� ���� ������ �� ������� ��. ∪ ��1��2��.� ≥ ������ ������ ����� ����� ���������� ��/ ∪ ��1��2��/� � � � � ������� ��� ���� ���� � � ������ � � ������� �� /����� � �� � �� � �� 0 � 16

  17. Example: Mutual information Given random variables X 1 ,…,X n F(A) = I(X A ; X V \ A ) = H(X V \ A ) – H(X V \ A |X A ) Lemma: Mutual information F(A) is submodular F(A ∪ {s}) – F(A) = H(X s | X A ) – H(X s | X V \ (A ∪ {s}) ) δ s (A) = F(A ∪ {s})-F(A) monotonically nonincreasing � F submodular ☺ 17

  18. Example: Influence in social networks [Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Alice ��� Prob. of influencing �� ��0 ��� ��� �� Bob �� Fiona Charlie Who should get free cell phones? V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona} F(A) = Expected number of people influenced when targeting A 18

  19. Influence in social networks is submodular [Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Alice ��� �� ��0 ��� ��� �� Bob �� Fiona Charlie Key idea: Flip coins c in advance � “live” edges F c (A) = People influenced under outcome c (set cover!) F(A) = ∑ c P(c) F c (A) is submodular as well! 19

  20. Closedness properties F 1 ,…,F m submodular functions on V and λ 1 ,…, λ m > 0 Then: F(A) = ∑ i λ i F i (A) is submodular! Submodularity closed under nonnegative linear combinations! Extremely useful fact!! F θ (A) submodular ⇒ ∑ θ P( θ ) F θ (A) submodular! Multicriterion optimization: F 1 ,…,F m submodular, λ i ≥ 0 ⇒ ∑ i λ i F i (A) submodular 20

  21. Submodularity and Concavity Suppose g: N → R and F(A) = g(|A|) Then F(A) submodular if and only if g concave! E.g., g could say “buying in bulk is cheaper” ,�3.3� 3.3 21

  22. Maximum of submodular functions Suppose F 1 (A) and F 2 (A) submodular. Is F(A) = max(F 1 (A),F 2 (A)) submodular? ��.����4�5�� � �.��� � �.�� � � �.� � � �.� 3.3 max(F 1 ,F 2 ) not submodular in general! 22

  23. Minimum of submodular functions Well, maybe F(A) = min(F 1 (A),F 2 (A)) instead? F 1 (A) F 2 (A) F(A) ���*���6 �� ∅ ��� ∅ 0 0 0 7 {a} 1 0 0 �����*���6 �������� {b} 0 1 0 {a,b} 1 1 1 min(F 1 ,F 2 ) not submodular in general! /+#��#�8�#+"�(�6 '�1����((�����4�" � � � ��#��9 23

  24. Maximizing submodular functions Minimizing convex functions: Minimizing submodular functions: Polynomial time solvable! Polynomial time solvable! Maximizing convex functions: Maximizing submodular functions: NP hard! NP hard! But can get approximation guarantees ☺ 24

  25. Maximizing influence [Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Alice 0.2 0.5 0.4 0.2 0.3 0.5 Bob 0.5 Fiona Charlie F(A) = Expected #people influenced when targeting A F monotonic: If A ⊆ B: F(A) ≤ F(B) Hence V = argmax A F(A) More interesting: argmax A F(A) – Cost(A) 25

  26. Maximizing non-monotonic functions 4�5�4+4 Suppose we want for not monotonic F A* = argmax F(A) s.t. A ⊆ V 3.3 Example: F(A) = U(A) – C(A) where U(A) is submodular utility, and C(A) is supermodular cost function In general: NP hard. Moreover: If F(A) can take negative values: As hard to approximate as maximum independent set (i.e., NP hard to get O(n 1- ε ) approximation) 26

Recommend


More recommend