two useful arrows darts in that quiver
play

Two Useful Arrows Darts in that Quiver Clment Canonne FOCS Workshop - PowerPoint PPT Presentation

Two Useful Arrows Darts in that Quiver Clment Canonne FOCS Workshop November 9, 2019 1 Avering, Bucketing, and Investing arguments 2 Suppose you have a : X [0,1] such that E [ a ( x )] . (Lets say you already proved that.)


  1. Two Useful Arrows Darts in that Quiver Clément Canonne FOCS Workshop – November 9, 2019 1

  2. Avering, Bucketing, and Investing arguments 2

  3. Suppose you have a : X → [0,1] such that E [ a ( x )] ≥ ε . (Let’s say you already proved that.) We think of a ( x ) as the quality of x , and “using” it has cost cost( a ( x )). 3

  4. Suppose you have a : X → [0,1] such that E [ a ( x )] ≥ ε . (Let’s say you already proved that.) We think of a ( x ) as the quality of x , and “using” it has cost cost( a ( x )). For instance, a population of coins, each with their own bias. The expected bias is ε ; for any given coin, checking bias 0 vs. bias α takes 1/ α 2 tosses. Goal: find a biased coin. 4

  5. How... to convert this into a useful thing? How to find an x with small cost? That is, can we get Pr x [ a ( x ) ≥ blah( ε )] ≥ bluh( ε ) for some “good” functions blah, bluh? 5

  6. “By a standard averaging argument...” First attempt: Markov Lemma (Markov) We have a ( x ) ≥ ε ≥ ε � � Pr 2 . (1) 2 x 6

  7. “By a standard averaging argument...” First attempt: Markov Lemma (Markov) We have a ( x ) ≥ ε ≥ ε � � Pr 2 . (1) 2 x Proof. ε ≤ E [ a ( x )] ≤ ε � a ( x ) < ε � � a ( x ) ≥ ε � 2 · Pr + 1 · Pr x 2 x 2 ≤ 1 7

  8. “By a standard averaging argument...” First attempt: Markov Strategy Sample O (1/ ε ) x ’s to find a “good” one; for each, pay cost( ε /2). 8

  9. “By a standard averaging argument...” First attempt: Markov Strategy Sample O (1/ ε ) x ’s to find a “good” one; for each, pay cost( ε /2). Yes, but... Typically, at least quadratic total cost in ε as cost( α ) = Ω (1/ α ). 9

  10. “By a standard averaging argument...” First attempt: Markov Strategy Sample O (1/ ε ) x ’s to find a “good” one; for each, pay cost( ε /2). Yes, but... Typically, at least quadratic total cost in ε as cost( α ) = Ω (1/ α ). We should not pay the worst of both worlds. 10

  11. “By a standard bucketing argument...” Second attempt: my bucket list Lemma (Bucketing) There exists 1 ≤ j ≤ ⌈ log(2/ ε ) ⌉ : = L s.t. ≥ 2 j ε � a ( x ) ≥ 2 − j � Pr 4 L . (2) x 11

  12. “By a standard bucketing argument...” Second attempt: my bucket list Lemma (Bucketing) There exists 1 ≤ j ≤ ⌈ log(2/ ε ) ⌉ : = L s.t. ≥ 2 j ε � a ( x ) ≥ 2 − j � Pr 4 L . (2) x Proof. Define buckets B 0 : = { x : a ( x ) ≤ ε /2}, B j : = { x : 2 − j ≤ a ( x ) ≤ 2 − j + 1 },1 ≤ j ≤ L Then L ε ≤ E [ a ( x )] ≤ ε � � � 2 − j + 1 · Pr 2 · Pr[ x ∈ B 0 ] + x ∈ B j j = 1 ≤ 1 � � so (averaging!) there exists j ∗ s.t. 2 − j + 1 · Pr ≥ ε /(2 L ). x ∈ B j 12

  13. “By a standard bucketing argument...” Second attempt: my bucket list Strategy For each j ∈ [ L ], in case it’s the good bucket: � sample O (log(1/ ε )/(2 j ε )) x ’s to find a “good” one in B j ; � for each such x , pay cost(2 − j ). 13

  14. “By a standard bucketing argument...” Second attempt: my bucket list Strategy For each j ∈ [ L ], in case it’s the good bucket: � sample O (log(1/ ε )/(2 j ε )) x ’s to find a “good” one in B j ; � for each such x , pay cost(2 − j ). Total cost (examples):  log 2 (1/ ε ) if cost( α ) ≍ 1/ α  L log(1/ ε )  ε � · cost(2 − j ) = log(1/ ε ) 2 j ε if cost( α ) ≍ 1/ α 2  j = 1  ε 2 14

  15. “By a standard bucketing argument...” Second attempt: my bucket list Strategy For each j ∈ [ L ], in case it’s the good bucket: � sample O (log(1/ ε )/(2 j ε )) x ’s to find a “good” one in B j ; � for each such x , pay cost(2 − j ). Total cost (examples):  log 2 (1/ ε ) if cost( α ) ≍ 1/ α  L log(1/ ε )  ε � · cost(2 − j ) = log(1/ ε ) 2 j ε if cost( α ) ≍ 1/ α 2  j = 1  ε 2 Yes, but... we lose log factors. Do we have to lose log factors? 15

  16. “By a refined averaging argument...” Third (and last) attempt: strategic investment Assume that cost( α ) is superlinear, e.g., cost( α ) = 1/ α 2 . Lemma (Levin’s Economical Work Investment Strategy) There exists 1 ≤ j ≤ ⌈ log(2/ ε ) ⌉ : = L s.t. 2 j ε � a ( x ) ≥ 2 − j � Pr 8( L + 1 − j ) 2 . (3) ≥ x 16

  17. “By a refined averaging argument...” Third (and last) attempt: strategic investment Assume that cost( α ) is superlinear, e.g., cost( α ) = 1/ α 2 . Lemma (Levin’s Economical Work Investment Strategy) There exists 1 ≤ j ≤ ⌈ log(2/ ε ) ⌉ : = L s.t. 2 j ε � a ( x ) ≥ 2 − j � Pr 8( L + 1 − j ) 2 . (3) ≥ x Proof. By contradiction: L L E [ a ( x )] ≤ ε ≤ ε � a ( x ) ≥ 2 − j � � � 2 − j + 1 · Pr � � 2 − j + 1 · Pr x ∈ B j 2 + 2 + j = 1 j = 1 L 2 j ε L ∞ < ε 8( L + 1 − j ) 2 = ε 2 + ε ℓ 2 < ε 1 2 + ε 1 � � � 2 − j + 1 · 2 + ℓ 2 < ε 4 4 j = 1 ℓ = 1 ℓ = 1 “Oops.” 17

  18. “By a refined averaging argument...” Third (and last) attempt: strategic investment Strategy For each j ∈ [ L ]: � sample O (( L + 1 − j ) 2 /(2 j ε )) x ’s to find a “good” one in B j ; � for each such x , pay cost(2 − j ) ≍ 2 2 j . 18

  19. “By a refined averaging argument...” Third (and last) attempt: strategic investment Strategy For each j ∈ [ L ]: � sample O (( L + 1 − j ) 2 /(2 j ε )) x ’s to find a “good” one in B j ; � for each such x , pay cost(2 − j ) ≍ 2 2 j . Total cost: L ( L + 1 − j ) 2 L ( L + 1 − j ) 2 · 2 j = 2 L + 1 L · 2 2 j = 1 � � � ℓ 2 · 2 − ℓ 2 j ε ε ε j = 1 j = 1 ℓ = 1 ∞ < 4 � ℓ 2 · 2 − ℓ (It’s 6.) ε 2 ℓ = 1 O (1) 19

  20. “By a refined averaging argument...” Third (and last) attempt: strategic investment Strategy For each j ∈ [ L ]: � sample O (( L + 1 − j ) 2 /(2 j ε )) x ’s to find a “good” one in B j ; � for each such x , pay cost(2 − j ) ≍ 2 2 j . Total cost: L ( L + 1 − j ) 2 L ( L + 1 − j ) 2 · 2 j = 2 L + 1 L · 2 2 j = 1 � � � ℓ 2 · 2 − ℓ 2 j ε ε ε j = 1 j = 1 ℓ = 1 ∞ < 4 � ℓ 2 · 2 − ℓ (It’s 6.) ε 2 ℓ = 1 O (1) Yes, but... No, actually, nothing. Works for any cost( α ) ≫ 1/ α 1 + δ . 20

  21. “By a refined averaging argument...” Third (and last) attempt: strategic investment Strategy For each j ∈ [ L ]: � sample O (( L + 1 − j ) 2 /(2 j ε )) x ’s to find a “good” one in B j ; � for each such x , pay cost(2 − j ) ≍ 2 2 j . Total cost: L ( L + 1 − j ) 2 L ( L + 1 − j ) 2 · 2 j = 2 L + 1 L · 2 2 j = 1 � � � ℓ 2 · 2 − ℓ 2 j ε ε ε j = 1 j = 1 ℓ = 1 ∞ < 4 � ℓ 2 · 2 − ℓ (It’s 6.) ε 2 ℓ = 1 O (1) Yes, but... No, actually, nothing. Works for any cost( α ) ≫ 1/ α 1 + δ . For cost( α ) ≍ 1/ α , not so easy, but some stuff exists. 21

  22. Thomas’ Favorite Lemma 22

  23. Kullback–Leibler Divergence Recall the definition of Kullback–Leibler divergence (a.k.a. relative entropy) between two discrete distributions p , q : p ( ω )log p ( ω ) � D ( p � q ) = q ( ω ) ω 23

  24. Kullback–Leibler Divergence Recall the definition of Kullback–Leibler divergence (a.k.a. relative entropy) between two discrete distributions p , q : p ( ω )log p ( ω ) � D ( p � q ) = q ( ω ) ω It has some issues (symmetry, triangle inequality), yes, but it is everywhere (for a reason). It also has many nice properties. 24

  25. Kullback–Leibler Divergence The dual characterization Theorem (First) For every q ≪ p, � � e f ( x ) �� � � D ( p � q ) = sup E x ∼ p f ( x ) − log E x ∼ q (4) f Theorem (Second) For every q ≪ p, and every λ � e λ x � � � log E x ∼ p = max λ E x ∼ q [ x ] − D ( q � p ) (5) q ≪ p Known as: Gibbs variational principle (1902?), Donsker-Varadhan (1975), special case of Fenchel duality, ... 25

  26. Kullback–Leibler Divergence The dual characterization Theorem (First) For every q ≪ p, � � e f ( x ) �� � � D ( p � q ) = sup E x ∼ p f ( x ) − log E x ∼ q (4) f Theorem (Second) For every q ≪ p, and every λ � e λ x � � � log E x ∼ p = max λ E x ∼ q [ x ] − D ( q � p ) (5) q ≪ p Known as: Gibbs variational principle (1902?), Donsker-Varadhan (1975), special case of Fenchel duality, ... 26

  27. An application Theorem Suppose p is subgaussian on R d . For every function a : R d → [0,1] (with α : = E x ∼ p [ a ( x )] > 0 ), � log 1 � E x ∼ p [ xa ( x )] � 2 ≤ C p α (6) α (constant C p depends on subgaussian parameter, not on d). 27

  28. An application Theorem Suppose p is subgaussian on R d . For every function a : R d → [0,1] (with α : = E x ∼ p [ a ( x )] > 0 ), � log 1 � E x ∼ p [ xa ( x )] � 2 ≤ C p α (6) α (constant C p depends on subgaussian parameter, not on d). The proof that follows was communicated to me by Himanshu Tyagi. 28

Recommend


More recommend