central limit theorem variants and applications
play

Central limit theorem: variants and applications Anindya De - PowerPoint PPT Presentation

Central limit theorem: variants and applications Anindya De University of Pennsylvania Introduction One of the most cornerstone results in probability theory and statistics. Introduction One of the most cornerstone results in probability


  1. Showing � S ( ξ ) is close to � Z ( ξ ) X ( ξ/ √ n ) is valid only if | ξ | is small. Note: Taylor expansion of � √ n Claim: For | ξ | ≤ 100 , we have � 1 � � � √ n | ξ | 3 e − ξ 2 / 3 �� S ( ξ ) − � � = O Z ( ξ ) . Plugging this back into approximate Fourier inversion (with T = √ n / 100), � 1 � � ξ = T | ξ | 2 e − ξ 2 / 3 Pr[ S ≤ x ] − Pr[ Z ≤ x ] ≤ 1 √ n d ξ + O . 2 π T ξ = − T

  2. Showing � S ( ξ ) is close to � Z ( ξ ) X ( ξ/ √ n ) is valid only if | ξ | is small. Note: Taylor expansion of � √ n Claim: For | ξ | ≤ 100 , we have � 1 � � � √ n | ξ | 3 e − ξ 2 / 3 �� S ( ξ ) − � � = O Z ( ξ ) .

  3. Showing � S ( ξ ) is close to � Z ( ξ ) X ( ξ/ √ n ) is valid only if | ξ | is small. Note: Taylor expansion of � √ n Claim: For | ξ | ≤ 100 , we have � 1 � � � √ n | ξ | 3 e − ξ 2 / 3 �� S ( ξ ) − � � = O Z ( ξ ) . Proof technique: Split | ξ | into the high | ξ | regime and the low | ξ | regime. In particular, define 1 1 1 6 } and Γ high = { ξ : n 6 } . 2 / 100 ≥ | ξ | > n Γ low = { ξ : | ξ | ≤ n

  4. Showing � S ( ξ ) is close to � Z ( ξ ) X ( ξ/ √ n ) is valid only if | ξ | is small. Note: Taylor expansion of � √ n Claim: For | ξ | ≤ 100 , we have � 1 � � � √ n | ξ | 3 e − ξ 2 / 3 �� S ( ξ ) − � � = O Z ( ξ ) .

  5. Showing � S ( ξ ) is close to � Z ( ξ ) X ( ξ/ √ n ) is valid only if | ξ | is small. Note: Taylor expansion of � √ n Claim: For | ξ | ≤ 100 , we have � 1 � � � √ n | ξ | 3 e − ξ 2 / 3 �� S ( ξ ) − � � = O Z ( ξ ) . Proof technique: When ξ ∈ Γ low , then we apply Taylor’s expansion – recall � � n 1 − ξ 2 � 2 n + o ( | ξ | 2 / n ) S ( ξ ) = .

  6. Showing � S ( ξ ) is close to � Z ( ξ ) X ( ξ/ √ n ) is valid only if | ξ | is small. Note: Taylor expansion of � √ n Claim: For | ξ | ≤ 100 , we have � 1 � � � �� S ( ξ ) − � � = O √ n | ξ | 3 e − ξ 2 / 3 Z ( ξ ) .

  7. Showing � S ( ξ ) is close to � Z ( ξ ) X ( ξ/ √ n ) is valid only if | ξ | is small. Note: Taylor expansion of � √ n Claim: For | ξ | ≤ 100 , we have � 1 � � � �� S ( ξ ) − � � = O √ n | ξ | 3 e − ξ 2 / 3 Z ( ξ ) . Proof technique: On the other hand, it is not difficult to show that S ( ξ ) | ≤ e − 2 ξ 2 Z ( ξ ) | = e − ξ 2 | � 3 . Using the fact that | � 2 . When ξ ∈ Γ high , this is enough

  8. Finishing the proof of Berry-Ess´ een √ n For all | ξ | ≤ 100 , we have � 1 � � � √ n | ξ | 3 e − ξ 2 / 3 �� S ( ξ ) − � � = O Z ( ξ ) .

  9. Finishing the proof of Berry-Ess´ een √ n For all | ξ | ≤ 100 , we have � 1 � � � √ n | ξ | 3 e − ξ 2 / 3 �� S ( ξ ) − � � = O Z ( ξ ) . Plugging this back into the approximate Fourier inversion formula (which is) � ξ = T � 1 � | � S ( ξ ) − � | Pr[ S ≤ x ] − Pr[ Z ≤ x ] | ≤ 1 Z ( ξ ) | d ξ + O , 2 π | ξ | T ξ = − T we get Pr[ S ≤ x ] − Pr[ Z ≤ x ] | = O ( n − 1 / 2 ).

  10. Application of the Berry-Ess´ een theorem Stochastic knapsack Suppose you have n items, each with a profit c i and a stochastic weight X i where each X i is a positive valued random variable.

  11. Application of the Berry-Ess´ een theorem Stochastic knapsack Suppose you have n items, each with a profit c i and a stochastic weight X i where each X i is a positive valued random variable. Goal: Given a knapsack with capacity θ and error tolerance probability p , pack a subset S of items such that � X j ≤ θ ] ≥ 1 − p , Pr[ j ∈ S so that the profit � j ∈ S c j is maximized.

  12. Berry-Ess´ een theorem for stochastic knapsack Stochastic knapsack Suppose you have n items, each with a profit c i and a stochastic weight X i where each X i is of the form � w.p. 1 w ℓ, i 2 X i = w.p. 1 w h , i 2 Here all w ℓ, i ∈ [1 , . . . , M / 4] and w h , i ∈ [3 M / 4 , . . . , M ] where M = poly( n ). Further, all profits c i ∈ [1 , . . . , M ].

  13. Algorithmic result for stochastic knapsack Result: There is an algorithm which for any error parameter ǫ > 0, runs in time poly( M , n 1 /ǫ 2 ) and outputs a set S ∗ such that � Pr[ X j ≤ θ ] ≥ 1 − p − ǫ, j ∈ S ∗ such that � j ∈ S ∗ c j = OPT.

  14. Algorithmic result for stochastic knapsack Result: There is an algorithm which for any error parameter ǫ > 0, runs in time poly( M , n 1 /ǫ 2 ) and outputs a set S ∗ such that � Pr[ X j ≤ θ ] ≥ 1 − p − ǫ, j ∈ S ∗ such that � j ∈ S ∗ c j = OPT. Key feature: We do not relax the knapsack capacity θ .

  15. Algorithmic result for stochastic knapsack Result: There is an algorithm which for any error parameter ǫ > 0, runs in time poly( M , n 1 /ǫ 2 ) and outputs a set S ∗ such that � Pr[ X j ≤ θ ] ≥ 1 − p − ǫ, j ∈ S ∗ such that � j ∈ S ∗ c j = OPT. Key feature: We do not relax the knapsack capacity θ . See paper in SODA 2018 for the most general version of results.

  16. Main idea behind the algorithm Observation I: If we “center” random variable X i , i.e, Y i = X i − E [ X i ], then it satisfies � � 3 / 2 . E [ | Y i | 3 ] ≤ E [ | Y i | 2 ]

  17. Main idea behind the algorithm Observation I: If we “center” random variable X i , i.e, Y i = X i − E [ X i ], then it satisfies � � 3 / 2 . E [ | Y i | 3 ] ≤ E [ | Y i | 2 ] Thus, we can potentially apply Berry-Ess´ een to a sum of X i .

  18. Main idea behind the algorithm Observation I: If we “center” random variable X i , i.e, Y i = X i − E [ X i ], then it satisfies � � 3 / 2 . E [ | Y i | 3 ] ≤ E [ | Y i | 2 ] Thus, we can potentially apply Berry-Ess´ een to a sum of X i . Observation II: Consider any subset of items S with | S | ≥ 100 /ǫ 2 . Then, � Var( X i ) ≤ ǫ 2 · ( max Var( X j )) . i j ∈ S

  19. Algorithmic idea Step 1: Either the optimum solution S opt is such that | S opt | ≤ 100 /ǫ 2 . In this case, we can brute-force search for S opt . Running time is n Θ(1 /ǫ 2 ) .

  20. Algorithmic idea Step 1: Either the optimum solution S opt is such that | S opt | ≤ 100 /ǫ 2 . In this case, we can brute-force search for S opt . Running time is n Θ(1 /ǫ 2 ) . Step 2: Otherwise, | S opt | > 100 /ǫ 2 . In this case, define µ opt , σ 2 and C opt as (i) µ opt = � opt = � opt j ∈ S opt E [ X j ]; (ii) σ 2 j ∈ S opt Var( X j ); (iii) C opt = � j ∈ S opt c j .

  21. Algorithmic idea Observe that µ opt , σ 2 opt and C opt are all integral multiples of 1 / 4 bounded by M 2 .

  22. Algorithmic idea Observe that µ opt , σ 2 opt and C opt are all integral multiples of 1 / 4 bounded by M 2 . We use dynamic programming to find S ∗ such that C opt = C ∗ , µ opt = µ ∗ and σ 2 opt = σ 2 ∗ .

  23. Algorithmic idea Observe that µ opt , σ 2 opt and C opt are all integral multiples of 1 / 4 bounded by M 2 . We use dynamic programming to find S ∗ such that C opt = C ∗ , µ opt = µ ∗ and σ 2 opt = σ 2 ∗ . Consequence of Berry-Ess´ een theorem: � � Pr[ X j ≤ θ ] ≥ Pr[ X j ≤ θ ] − ǫ. j ∈ S ∗ j ∈ S opt

  24. Algorithmic idea Consequence of Berry-Ess´ een theorem: � � Pr[ X j ≤ θ ] ≥ Pr[ X j ≤ θ ] − ǫ. j ∈ S ∗ j ∈ S opt een, the distribution of � This is because by Berry-Ess´ j ∈ S ∗ X j and � j ∈ S opt X j follows essentially Gaussian (and their means and variances match).

  25. General algorithmic result for stochastic optimization Suppose the item sizes { X i } n i =1 are all hypercontractive – i.e., E [ | X i | 3 ] ≤ O (1) · ( E [ | X i | 2 ]) 3 / 2 .

  26. General algorithmic result for stochastic optimization Suppose the item sizes { X i } n i =1 are all hypercontractive – i.e., E [ | X i | 3 ] ≤ O (1) · ( E [ | X i | 2 ]) 3 / 2 . Theorem: When item sizes are hypercontractive, then there is an algorithm running in time n O (1 /ǫ 2 ) such that the output set S ∗ satisfies 1. � j ∈ S ∗ c j ≥ (1 − ǫ ) · ( � j ∈ S opt c j ). 2. Pr[ � j ∈ S ∗ X j ≤ θ ] ≥ Pr[ � j ∈ S opt X j ≤ θ ] − ǫ.

  27. General algorithmic result for stochastic optimization Suppose the item sizes { X i } n i =1 are all hypercontractive – i.e., E [ | X i | 3 ] ≤ O (1) · ( E [ | X i | 2 ]) 3 / 2 . Theorem: When item sizes are hypercontractive, then there is an algorithm running in time n O (1 /ǫ 2 ) such that the output set S ∗ satisfies 1. � j ∈ S ∗ c j ≥ (1 − ǫ ) · ( � j ∈ S opt c j ). 2. Pr[ � j ∈ S ∗ X j ≤ θ ] ≥ Pr[ � j ∈ S opt X j ≤ θ ] − ǫ. Read SODA 2018 paper for more details.

  28. Central limit theorems: Citius , Altius , Fortius

  29. Central limit theorems: Citius , Altius , Fortius Let’s do Altius – as in higher degree polynomials.

  30. Central limit theorems: Citius , Altius , Fortius Let’s do Altius – as in higher degree polynomials. Berry-Ess´ een says that sums of independent random variables under mild conditions converges to a Gaussian.

  31. Central limit theorems: Citius , Altius , Fortius Let’s do Altius – as in higher degree polynomials. Berry-Ess´ een says that sums of independent random variables under mild conditions converges to a Gaussian. What if we replace the sum by a polynomial? Let us think of the easy case when the degree is 2.

  32. Central limit theorem for low-degree polynomials � x 1 + ... + x n � 2 . As n → ∞ , x 1 , . . . , x n are i.i.d. Consider p ( x ) = √ n copies of unbiased ± 1 random variables, the distribution of p ( x ) goes to a

  33. Central limit theorem for low-degree polynomials � x 1 + ... + x n � 2 . As n → ∞ , x 1 , . . . , x n are i.i.d. Consider p ( x ) = √ n copies of unbiased ± 1 random variables, the distribution of p ( x ) goes to a χ 2 distribution.

  34. Central limit theorem for low-degree polynomials � x 1 + ... + x n � 2 . As n → ∞ , x 1 , . . . , x n are i.i.d. Consider p ( x ) = √ n copies of unbiased ± 1 random variables, the distribution of p ( x ) goes to a χ 2 distribution. In fact, suppose p ( x ) is of degree-2 and of the following form: p ( x ) = λ · ℓ 2 ( x ) + q ( x ) m where ℓ ( x ) is a linear form and λ = E [ p ( x ) · ℓ 2 ( x )]. If λ is large, then p ( x ) is very far from a Gaussian.

  35. Central limit theorem for quadratic polynomials Theorem Let p ( x ) : R n → R such that Var( p ( x )) = 1 and E [ p ( x )] = µ . Express p ( x ) = x T Ax + � b , x � + c where A ∈ R n × n and b ∈ R n . Let � A � op ≤ ǫ and � b � ∞ ≤ ǫ . Suppose, x ∼ {− 1 , 1 } n . Then, d K ( p ( x ) , N ( µ, 1)) = O ( √ ǫ ) . If p ( x ) is not correlated with product of two linear forms, then it is distributed as a Gaussian.

  36. Central limit theorem for higher degree polynomials Corresponding to any multilinear polynomial p : R n → R of degree- d , we have a sequence of tensors ( A d , . . . , A 0 ) where A i ∈ R n × i is a tensor of order i .

  37. Central limit theorem for higher degree polynomials Corresponding to any multilinear polynomial p : R n → R of degree- d , we have a sequence of tensors ( A d , . . . , A 0 ) where A i ∈ R n × i is a tensor of order i . For a tensor A i (where i > 1), we use σ max ( A i ) to denote the “maximum singular value” obtained by a non-trivial flattening.

  38. Central limit theorem for higher degree polynomials Corresponding to any multilinear polynomial p : R n → R of degree- d , we have a sequence of tensors ( A d , . . . , A 0 ) where A i ∈ R n × i is a tensor of order i . For a tensor A i (where i > 1), we use σ max ( A i ) to denote the “maximum singular value” obtained by a non-trivial flattening. Theorem Let p : R n → R be a degree-d polynomial with Var( p ( x )) = 1 and E [ p ( x )] = µ . Let ( A d , . . . , A 0 ) denote the tensors corresponding to p. Then, d K ( p ( x ) , N ( µ, 1)) = O d ( √ ǫ ) , where x ∼ {− 1 , 1 } n . Here ǫ ≥ max j > 1 σ max ( A j ) and ǫ ≥ �A 0 � ∞ .

  39. Features of the central limit theorem 1. Qualitatively tight: in particular, for a polynomial if max j > 1 σ max ( A j ) is large, then the distribution of p ( x ) does not look like a Gaussian.

  40. Features of the central limit theorem 1. Qualitatively tight: in particular, for a polynomial if max j > 1 σ max ( A j ) is large, then the distribution of p ( x ) does not look like a Gaussian. 2. max j > 1 σ max ( A j ) is essentially capturing correlation of p ( x ) with product of two lower degree polynomials.

  41. Features of the central limit theorem 1. Qualitatively tight: in particular, for a polynomial if max j > 1 σ max ( A j ) is large, then the distribution of p ( x ) does not look like a Gaussian. 2. max j > 1 σ max ( A j ) is essentially capturing correlation of p ( x ) with product of two lower degree polynomials. 3. Condition for convergence to normal is efficiently checkable.

  42. Proof of the central limit theorem • The first step is to go from x ∼ {− 1 , 1 } n to x ∼ N n (0 , 1). Accomplished via the invariance principle .

  43. Proof of the central limit theorem • The first step is to go from x ∼ {− 1 , 1 } n to x ∼ N n (0 , 1). Accomplished via the invariance principle . • Once in the Gaussian domain, the question is when does a polynomial of a Gaussian look like a Gaussian?

  44. Proof of the central limit theorem • The first step is to go from x ∼ {− 1 , 1 } n to x ∼ N n (0 , 1). Accomplished via the invariance principle . • Once in the Gaussian domain, the question is when does a polynomial of a Gaussian look like a Gaussian? • Proof technique: Stein’s method + Malliavin calculus.

  45. Central limit theorem – application in derandomization • Derandomization – fertile ground both for applications and discovery of central limit theorems (in computer science)

  46. Central limit theorem – application in derandomization • Derandomization – fertile ground both for applications and discovery of central limit theorems (in computer science) • Why is central limit theorem useful for derandomization?

  47. Central limit theorem – application in derandomization • Derandomization – fertile ground both for applications and discovery of central limit theorems (in computer science) • Why is central limit theorem useful for derandomization? • Example: Suppose we are given a halfspace f : {− 1 , 1 } n → {− 1 , 1 } where f ( x ) = sign( � n i =1 w i x i − θ ).

  48. Central limit theorem – application in derandomization • Derandomization – fertile ground both for applications and discovery of central limit theorems (in computer science) • Why is central limit theorem useful for derandomization? • Example: Suppose we are given a halfspace f : {− 1 , 1 } n → {− 1 , 1 } where f ( x ) = sign( � n i =1 w i x i − θ ). Deterministically compute Pr x ∈{− 1 , 1 } n [ f ( x ) = 1].

  49. Derandomizing halfspaces • For f ( x ) = sign( � n i =1 w i x i − θ ), exactly computing Pr x ∈{− 1 , 1 } n [ f ( x ) = 1] is # P -hard.

  50. Derandomizing halfspaces • For f ( x ) = sign( � n i =1 w i x i − θ ), exactly computing Pr x ∈{− 1 , 1 } n [ f ( x ) = 1] is # P -hard. • Computing Pr x ∈{− 1 , 1 } n [ f ( x ) = 1] to additive error ǫ is trivial using randomness.

  51. Derandomizing halfspaces • For f ( x ) = sign( � n i =1 w i x i − θ ), exactly computing Pr x ∈{− 1 , 1 } n [ f ( x ) = 1] is # P -hard. • Computing Pr x ∈{− 1 , 1 } n [ f ( x ) = 1] to additive error ǫ is trivial using randomness. • What can we do deterministically? or how are CLTs going to be useful?

  52. Derandomizing halfspaces • For f ( x ) = sign( � n i =1 w i x i − θ ), exactly computing Pr x ∈{− 1 , 1 } n [ f ( x ) = 1] is # P -hard. • Computing Pr x ∈{− 1 , 1 } n [ f ( x ) = 1] to additive error ǫ is trivial using randomness. • What can we do deterministically? or how are CLTs going to be useful? • [Servedio 2007]: Suppose all the | w i | ≤ ǫ/ 100 (where � w � 2 = 1).

  53. Berry-Ess´ een in action � n � � � � Pr w i x i − θ ≥ 0 ≈ ǫ Pr g − θ ≥ 0 x ∈{− 1 , 1 } n g ∼N (0 , 1) i =1

  54. Berry-Ess´ een in action � n � � � � Pr w i x i − θ ≥ 0 ≈ ǫ Pr g − θ ≥ 0 x ∈{− 1 , 1 } n g ∼N (0 , 1) i =1 � � • However, Pr g ∼N (0 , 1) g − θ ≥ 0 can be computed in O ǫ (1) time.

  55. Berry-Ess´ een in action � n � � � � Pr w i x i − θ ≥ 0 ≈ ǫ Pr g − θ ≥ 0 x ∈{− 1 , 1 } n g ∼N (0 , 1) i =1 � � • However, Pr g ∼N (0 , 1) g − θ ≥ 0 can be computed in O ǫ (1) time. � � n � • Thus, when | w i | ≤ ǫ/ 100, Pr x ∈{− 1 , 1 } n i =1 w i x i − θ ≥ 0 can be computed to ± ǫ in O ǫ (1) · poly( n ) time.

  56. What if max | w i | ≥ ǫ/ 100? • Suppose | w 1 | ≥ ǫ/ 100. We recurse on the variable x 1 . n n � � f x 1 =1 = sign( w i x i − θ + w 1 ); f x 1 = − 1 = sign( w i x i − θ − w 1 ) i =2 i =2

  57. What if max | w i | ≥ ǫ/ 100? • Suppose | w 1 | ≥ ǫ/ 100. We recurse on the variable x 1 . n n � � f x 1 =1 = sign( w i x i − θ + w 1 ); f x 1 = − 1 = sign( w i x i − θ − w 1 ) i =2 i =2 • Observe that it suffices to compute � � 1 2 · x ∈{− 1 , 1 } n − 1 [ f x 1 =1 ( x ) = 1] + Pr x ∈{− 1 , 1 } n − 1 [ f x 1 = − 1 ( x ) = 1] Pr .

  58. Berry-Ess´ een in recursive action �� n • Either max j ≥ 2 | w j | ≤ ( ǫ/ 100) · i =2 w 2 i .

  59. Berry-Ess´ een in recursive action �� n • Either max j ≥ 2 | w j | ≤ ( ǫ/ 100) · i =2 w 2 i . • If yes, we can apply the Berry-Ess´ een theorem.

  60. Berry-Ess´ een in recursive action �� n • Either max j ≥ 2 | w j | ≤ ( ǫ/ 100) · i =2 w 2 i . • If yes, we can apply the Berry-Ess´ een theorem. • Else, we restrict w 2 . Note: every time we restrict a variable, we capture an ǫ -fraction of the remaining ℓ 2 mass.

  61. Berry-Ess´ een in recursive action �� n • Either max j ≥ 2 | w j | ≤ ( ǫ/ 100) · i =2 w 2 i . • If yes, we can apply the Berry-Ess´ een theorem. • Else, we restrict w 2 . Note: every time we restrict a variable, we capture an ǫ -fraction of the remaining ℓ 2 mass. • Suppose the process goes on for j iterations. Either j ≤ ǫ − 1 log(1 /ǫ ) or j > ǫ − 1 log(1 /ǫ ).

  62. Berry-Ess´ een in recursive action • If j ≤ ǫ − 1 log(1 /ǫ ), then this reduces the problem to exp(1 /ǫ ) subproblems – each of which can be solved using Berry-Ess´ een.

  63. Berry-Ess´ een in recursive action • If j ≤ ǫ − 1 log(1 /ǫ ), then this reduces the problem to exp(1 /ǫ ) subproblems – each of which can be solved using Berry-Ess´ een. • If j > ǫ − 1 log(1 /ǫ ), we simply stop at j = ǫ − 1 log(1 /ǫ ). • The top ǫ − 1 log(1 /ǫ ) weights capture most of the ℓ 2 mass.

  64. Berry-Ess´ een in recursive action • If j ≤ ǫ − 1 log(1 /ǫ ), then this reduces the problem to exp(1 /ǫ ) subproblems – each of which can be solved using Berry-Ess´ een. • If j > ǫ − 1 log(1 /ǫ ), we simply stop at j = ǫ − 1 log(1 /ǫ ). • The top ǫ − 1 log(1 /ǫ ) weights capture most of the ℓ 2 mass. • Non-trivial: Since ǫ − 1 log(1 /ǫ ) weights capture most of the mass of the vector w , we can just consider the halfspace over these variables.

Recommend


More recommend