Optimal Distribution Testing via Reductions Ilias Diakonikolas USC Joint work with Daniel Kane (UCSD)
Distribution Testing Given samples from one or more unknown probability distributions, decide whether they satisfy a certain property. • Introduced by Karl Pearson (1899). • Classical Problem in Statistics [Neyman-Pearson’33, Lehman-Romano’05] • Last fifteen years (TCS): property testing [Goldreich-Ron’00, Batu et al. FOCS’00/JACM’13]
Notation Basic object of study: Probability distributions over finite domain. [ n ] d [ n ] or Notation: p , q : probability mass function
Example: Testing Closeness • Let be a family of probability distributions D Unknown 1, 2, 2, 4, 3,… p ∈ D Unknown 2, 1, 2, 3, 1,… q ∈ D Total Variation Distance Example: d TV ( p, q ) = (1 / 2) k p � q k 1 Testing Closeness Problem: − Distinguish between the cases p = q and dist ( p , q ) > ε − Minimize sample size , computation time
This Work Simple Framework for Distribution Testing: Leads to sample-optimal and computationally efficient estimators for a variety of properties Primarily based on: A New Approach for Testing Properties of Discrete Distributions (I. Diakonikolas and D. Kane, FOCS’16)
Outline § Related and Prior Work § Framework Overview and Statement of Results § Case Study: Testing Identity, Closeness, and Independence § Future Directions and Concluding Remarks
Outline § Related and Prior Work § Framework Overview and Statement of Results § Case Study: Testing Identity, Closeness, and Independence § Future Directions and Concluding Remarks
Prior Work: Identity Testing Focus has been on arbitrary distributions over support of size . n Testing Identity to a known Distribution: O ( √ n/ ✏ 4 ) • [Goldreich-Ron’00]: upper bound for uniformity testing (collision statistics) e • [Batu et al., FOCS’01]: upper bound for testing O ( √ n ) · poly(1 / ✏ ) identity to any known distribution. O ( √ n/ ✏ 2 ) • [Paninski ’03]: upper bound of for uniformity testing, ✏ = Ω ( n − 1 / 4 ) Ω ( √ n/ ✏ 2 ) assuming . Lower bound of . • [Valiant-Valiant, FOCS’14, D-Kane-Nikishkin, SODA’15]: upper O ( √ n/ ✏ 2 ) bound of for identity testing to any known distribution. • [D-Gouleakis-Peebles-Price’16]: [GR’00] tester is optimal!
Prior Work: Closeness Testing Focus has been on arbitrary distributions over support of size . n Testing Closeness between two unknown distributions: O ( n 2 / 3 log n/ ✏ 8 / 3 ) • [Batu et al., FOCS’00]: upper bound for testing closeness between two unknown discrete distributions. Ω ( n 2 / 3 ) • [P. Valiant, STOC’08]: lower bound of for constant error. • [Chan-D-Valiant-Valiant, SODA’14]: tight upper and lower bound of O (max { n 2 / 3 / ✏ 4 / 3 , n 1 / 2 / ✏ 2 } ) • [Bhatacharya-Valiant, NIPS’15]: tight bounds for different sample ✏ > n − 1 / 12 sizes (assuming ).
Prior Work: Testing Independence Focus has been on arbitrary distributions over support of size . n Testing Independence of a distribution on : [ n ] × [ m ] . e O ( n 2 / 3 m 1 / 3 · poly(1 / ✏ )) • [Batu et al., FOCS’01]: upper bound. • [Levi-Ron-Rubinfeld, ICS’11]: lower bounds for constant error Ω ( m 1 / 2 n 1 / 2 ) Ω ( n 2 / 3 m 1 / 3 ) , for n = Ω ( m log m ) and • [Acharya-Daskalakis-Kamath, NIPS’15]: upper bound of O ( n/ ✏ 2 ) for n = m .
Outline § Related and Prior Work § Framework Overview and Statement of Results § Case Study: Testing Identity, Closeness, and Independence § Future Directions and Concluding Remarks
L2 Closeness Testing Lemma 1: Let be unknown distributions on a domain of size . p, q n There is an algorithm that uses O (min { k p k 2 , k q k 2 } n/ ✏ 2 ) samples from each of , and with probability at least 2/3 p, q distinguishes between the cases that and k p � q k 1 � ✏ . p = q Basic Tester [Chan-D-Valiant-Valiant’14] : Calculate Z = Σ i {( X i – Y i ) 2 – X i – Y i } • If Z > ε 2 m 2 then output “No” (different), otherwise, output “Yes” • (same) Collision-based estimator also works [D-Gouleakis-Peebles-Price’16]
Main New Idea Solve all problems by reducing to this as a black-box.
Framework and Results • Approach : Reduction of L1 Testing to L2 testing 1) Transform given distribution(s) to new distribution(s) (over potentially larger domain) with small L2 norm. 2) Use standard L2 tester as a black-box. • Circumvents method of explicitly learning heavy elements [Batu et al., FOCS’00]
Algorithmic Applications Sample Optimal Testers for: Simpler • Identity to a Fixed Distribution Proofs of • Closeness between two Unknown Distributions Known • (Nearly) Instance-optimal Identity Testing Results • Closeness with unequal sample size • Adaptive Closeness Testing • Independence (in any dimension) New • Properties of Collections of Distributions Results (Sample & Query model) • Testing Histograms • Other Metrics (chi-squared, Hellinger) All algorithms follow same pattern. Very simple analysis.
Outline § Related and Prior Work § Framework Overview and Statement of Results § Case Study: Testing Identity, Closeness, and Independence § Future Directions and Concluding Remarks
Warm-up: Testing Identity to Fixed Distribution (I) Let be unknown distribution and known distribution on . [ n ] q p Main Idea : “Stretch” the domain size to make L 2 norm of small. q • For every bin create set of new bins. i ∈ [ n ] d nq i e S i • Subdivide the probability mass of bin equally within . S i i Let be the new domain and the resulting distributions over . p 0 , q 0 S S q 0 q … [ n ] S
Warm-up: Testing Identity to Fixed Distribution (II) Let be unknown distribution and known distribution on . [ n ] p q L1 Identity Tester • Given , construct new domain . S q k p 0 � q 0 k 1 � ✏ . p 0 = q 0 • Use basic tester to distinguish between and We construct explicitly. Can sample from given sample from q 0 p 0 p. Analysis: k p 0 � q 0 k 1 = k p � q k 1 Observation 1: k q 0 k 2 = O (1 / p n ) Observation 2: and | S | ≤ 2 n By Lemma 1, we can test identity between and with sample size q 0 p 0 O ( k q 0 k 2 | S | / ✏ 2 ) = O ( p n/ ✏ 2 )
Identity Reduces to Uniformity • Summary of Previous Slides: Identity reduces to its special case when the explicit distribution has max probability O (1 /n ) . • Recent Improvement: [Oded Goldreich’16]: Identity Reduces to Uniformity.
Testing Closeness (I) Let be unknown distributions on . [ n ] p, q Main Idea : Use samples from to “stretch” the domain size. q • Draw a set of samples from . Poi( k ) S q • Let be the number of times we see in . i ∈ [ n ] S a i • Subdivide the mass of bin equally within new bins. i a i + 1 Let be the new domain and the resulting distributions over . p 0 , q 0 S 0 S 0 We can sample from . p 0 , q 0 k p 0 � q 0 k 1 = k p � q k 1 Observation :
Testing Closeness (II) Let be unknown distributions on . [ n ] p, q L1 Closeness Tester • Draw a set of samples from , construct new domain . Poi( k ) S 0 S q k p 0 � q 0 k 1 � ✏ . p 0 = q 0 • Use basic tester to distinguish between and p Claim : Whp and | S 0 | ≤ n + O ( k ) k q 0 k 2 = O (1 / k ) . Proof : 2 = P n k p 0 k 2 i =1 p 2 i / (1 + a i ) , E [1 / (1 + a i )] 1 / ( kp i ) . ⇤ By Lemma 1, we can test identity between and with sample size q 0 p 0 O ( k q 0 k 2 | S 0 | / ✏ 2 ) = O ( k � 1 / 2 · ( n + k ) / ✏ 2 ) . Total sample size O ( k + k − 1 / 2 · ( n + k ) / ✏ 2 ) . k := min { n, n 2 / 3 ✏ − 4 / 3 } . Set
Closeness with Unequal Samples Let be unknown distributions on . [ n ] p, q Have samples from and samples from m 1 + m 2 p. q m 2 L1 Closeness Tester Unequal • Set k := min { n, m 1 } . • Draw samples from , construct new domain . Poi( k ) S 0 q k p 0 � q 0 k 1 � ✏ . • Use basic tester to distinguish between and p 0 = q 0 p | S 0 | ≤ n + O ( k ) Claim : Whp and k q 0 k 2 = O (1 / k ) . p 0 By Lemma 1, we can test identity between and with sample size q 0 m 2 = O ( k q 0 k 2 | S 0 | / ✏ 2 ) = O ( k � 1 / 2 · ( n + k ) / ✏ 2 ) . By our choice of k , it follows m 2 = O (max { nm − 1 / 2 ✏ 2 , n 1 / 2 / ✏ 2 } ) . 1
Recommend
More recommend