optimal distribution testing via reductions
play

Optimal Distribution Testing via Reductions Ilias Diakonikolas USC - PowerPoint PPT Presentation

Optimal Distribution Testing via Reductions Ilias Diakonikolas USC Joint work with Daniel Kane (UCSD) Distribution Testing Given samples from one or more unknown probability distributions, decide whether they satisfy a certain property.


  1. Optimal Distribution Testing via Reductions Ilias Diakonikolas USC Joint work with Daniel Kane (UCSD)

  2. Distribution Testing Given samples from one or more unknown probability distributions, decide whether they satisfy a certain property. • Introduced by Karl Pearson (1899). • Classical Problem in Statistics [Neyman-Pearson’33, Lehman-Romano’05] • Last fifteen years (TCS): property testing [Goldreich-Ron’00, Batu et al. FOCS’00/JACM’13]

  3. Notation Basic object of study: Probability distributions over finite domain. [ n ] d [ n ] or Notation: p , q : probability mass function

  4. Example: Testing Closeness • Let be a family of probability distributions D Unknown 1, 2, 2, 4, 3,… p ∈ D Unknown 2, 1, 2, 3, 1,… q ∈ D Total Variation Distance Example: d TV ( p, q ) = (1 / 2) k p � q k 1 Testing Closeness Problem: − Distinguish between the cases p = q and dist ( p , q ) > ε − Minimize sample size , computation time

  5. This Work Simple Framework for Distribution Testing: Leads to sample-optimal and computationally efficient estimators for a variety of properties Primarily based on: A New Approach for Testing Properties of Discrete Distributions (I. Diakonikolas and D. Kane, FOCS’16)

  6. Outline § Related and Prior Work § Framework Overview and Statement of Results § Case Study: Testing Identity, Closeness, and Independence § Future Directions and Concluding Remarks

  7. Outline § Related and Prior Work § Framework Overview and Statement of Results § Case Study: Testing Identity, Closeness, and Independence § Future Directions and Concluding Remarks

  8. Prior Work: Identity Testing Focus has been on arbitrary distributions over support of size . n Testing Identity to a known Distribution: O ( √ n/ ✏ 4 ) • [Goldreich-Ron’00]: upper bound for uniformity testing (collision statistics) e • [Batu et al., FOCS’01]: upper bound for testing O ( √ n ) · poly(1 / ✏ ) identity to any known distribution. O ( √ n/ ✏ 2 ) • [Paninski ’03]: upper bound of for uniformity testing, ✏ = Ω ( n − 1 / 4 ) Ω ( √ n/ ✏ 2 ) assuming . Lower bound of . • [Valiant-Valiant, FOCS’14, D-Kane-Nikishkin, SODA’15]: upper O ( √ n/ ✏ 2 ) bound of for identity testing to any known distribution. • [D-Gouleakis-Peebles-Price’16]: [GR’00] tester is optimal!

  9. Prior Work: Closeness Testing Focus has been on arbitrary distributions over support of size . n Testing Closeness between two unknown distributions: O ( n 2 / 3 log n/ ✏ 8 / 3 ) • [Batu et al., FOCS’00]: upper bound for testing closeness between two unknown discrete distributions. Ω ( n 2 / 3 ) • [P. Valiant, STOC’08]: lower bound of for constant error. • [Chan-D-Valiant-Valiant, SODA’14]: tight upper and lower bound of O (max { n 2 / 3 / ✏ 4 / 3 , n 1 / 2 / ✏ 2 } ) • [Bhatacharya-Valiant, NIPS’15]: tight bounds for different sample ✏ > n − 1 / 12 sizes (assuming ).

  10. Prior Work: Testing Independence Focus has been on arbitrary distributions over support of size . n Testing Independence of a distribution on : [ n ] × [ m ] . e O ( n 2 / 3 m 1 / 3 · poly(1 / ✏ )) • [Batu et al., FOCS’01]: upper bound. • [Levi-Ron-Rubinfeld, ICS’11]: lower bounds for constant error Ω ( m 1 / 2 n 1 / 2 ) Ω ( n 2 / 3 m 1 / 3 ) , for n = Ω ( m log m ) and • [Acharya-Daskalakis-Kamath, NIPS’15]: upper bound of O ( n/ ✏ 2 ) for n = m .

  11. Outline § Related and Prior Work § Framework Overview and Statement of Results § Case Study: Testing Identity, Closeness, and Independence § Future Directions and Concluding Remarks

  12. L2 Closeness Testing Lemma 1: Let be unknown distributions on a domain of size . p, q n There is an algorithm that uses O (min { k p k 2 , k q k 2 } n/ ✏ 2 ) samples from each of , and with probability at least 2/3 p, q distinguishes between the cases that and k p � q k 1 � ✏ . p = q Basic Tester [Chan-D-Valiant-Valiant’14] : Calculate Z = Σ i {( X i – Y i ) 2 – X i – Y i } • If Z > ε 2 m 2 then output “No” (different), otherwise, output “Yes” • (same) Collision-based estimator also works [D-Gouleakis-Peebles-Price’16]

  13. Main New Idea Solve all problems by reducing to this as a black-box.

  14. Framework and Results • Approach : Reduction of L1 Testing to L2 testing 1) Transform given distribution(s) to new distribution(s) (over potentially larger domain) with small L2 norm. 2) Use standard L2 tester as a black-box. • Circumvents method of explicitly learning heavy elements [Batu et al., FOCS’00]

  15. Algorithmic Applications Sample Optimal Testers for: Simpler • Identity to a Fixed Distribution Proofs of • Closeness between two Unknown Distributions Known • (Nearly) Instance-optimal Identity Testing Results • Closeness with unequal sample size • Adaptive Closeness Testing • Independence (in any dimension) New • Properties of Collections of Distributions Results (Sample & Query model) • Testing Histograms • Other Metrics (chi-squared, Hellinger) All algorithms follow same pattern. Very simple analysis.

  16. Outline § Related and Prior Work § Framework Overview and Statement of Results § Case Study: Testing Identity, Closeness, and Independence § Future Directions and Concluding Remarks

  17. Warm-up: Testing Identity to Fixed Distribution (I) Let be unknown distribution and known distribution on . [ n ] q p Main Idea : “Stretch” the domain size to make L 2 norm of small. q • For every bin create set of new bins. i ∈ [ n ] d nq i e S i • Subdivide the probability mass of bin equally within . S i i Let be the new domain and the resulting distributions over . p 0 , q 0 S S q 0 q … [ n ] S

  18. Warm-up: Testing Identity to Fixed Distribution (II) Let be unknown distribution and known distribution on . [ n ] p q L1 Identity Tester • Given , construct new domain . S q k p 0 � q 0 k 1 � ✏ . p 0 = q 0 • Use basic tester to distinguish between and We construct explicitly. Can sample from given sample from q 0 p 0 p. Analysis: k p 0 � q 0 k 1 = k p � q k 1 Observation 1: k q 0 k 2 = O (1 / p n ) Observation 2: and | S | ≤ 2 n By Lemma 1, we can test identity between and with sample size q 0 p 0 O ( k q 0 k 2 | S | / ✏ 2 ) = O ( p n/ ✏ 2 )

  19. Identity Reduces to Uniformity • Summary of Previous Slides: Identity reduces to its special case when the explicit distribution has max probability O (1 /n ) . • Recent Improvement: [Oded Goldreich’16]: Identity Reduces to Uniformity.

  20. Testing Closeness (I) Let be unknown distributions on . [ n ] p, q Main Idea : Use samples from to “stretch” the domain size. q • Draw a set of samples from . Poi( k ) S q • Let be the number of times we see in . i ∈ [ n ] S a i • Subdivide the mass of bin equally within new bins. i a i + 1 Let be the new domain and the resulting distributions over . p 0 , q 0 S 0 S 0 We can sample from . p 0 , q 0 k p 0 � q 0 k 1 = k p � q k 1 Observation :

  21. Testing Closeness (II) Let be unknown distributions on . [ n ] p, q L1 Closeness Tester • Draw a set of samples from , construct new domain . Poi( k ) S 0 S q k p 0 � q 0 k 1 � ✏ . p 0 = q 0 • Use basic tester to distinguish between and p Claim : Whp and | S 0 | ≤ n + O ( k ) k q 0 k 2 = O (1 / k ) . Proof : 2 = P n k p 0 k 2 i =1 p 2 i / (1 + a i ) , E [1 / (1 + a i )]  1 / ( kp i ) . ⇤ By Lemma 1, we can test identity between and with sample size q 0 p 0 O ( k q 0 k 2 | S 0 | / ✏ 2 ) = O ( k � 1 / 2 · ( n + k ) / ✏ 2 ) . Total sample size O ( k + k − 1 / 2 · ( n + k ) / ✏ 2 ) . k := min { n, n 2 / 3 ✏ − 4 / 3 } . Set

  22. Closeness with Unequal Samples Let be unknown distributions on . [ n ] p, q Have samples from and samples from m 1 + m 2 p. q m 2 L1 Closeness Tester Unequal • Set k := min { n, m 1 } . • Draw samples from , construct new domain . Poi( k ) S 0 q k p 0 � q 0 k 1 � ✏ . • Use basic tester to distinguish between and p 0 = q 0 p | S 0 | ≤ n + O ( k ) Claim : Whp and k q 0 k 2 = O (1 / k ) . p 0 By Lemma 1, we can test identity between and with sample size q 0 m 2 = O ( k q 0 k 2 | S 0 | / ✏ 2 ) = O ( k � 1 / 2 · ( n + k ) / ✏ 2 ) . By our choice of k , it follows m 2 = O (max { nm − 1 / 2 ✏ 2 , n 1 / 2 / ✏ 2 } ) . 1

Recommend


More recommend