Optimal Distribution Testing via Reductions Ilias Diakonikolas USC - PowerPoint PPT Presentation

Optimal Distribution Testing via Reductions Ilias Diakonikolas USC Joint work with Daniel Kane (UCSD)

Distribution Testing Given samples from one or more unknown probability distributions, decide whether they satisfy a certain property. • Introduced by Karl Pearson (1899). • Classical Problem in Statistics [Neyman-Pearson’33, Lehman-Romano’05] • Last fifteen years (TCS): property testing [Goldreich-Ron’00, Batu et al. FOCS’00/JACM’13]

Notation Basic object of study: Probability distributions over finite domain. [ n ] d [ n ] or Notation: p , q : probability mass function

Example: Testing Closeness • Let be a family of probability distributions D Unknown 1, 2, 2, 4, 3,… p ∈ D Unknown 2, 1, 2, 3, 1,… q ∈ D Total Variation Distance Example: d TV ( p, q ) = (1 / 2) k p � q k 1 Testing Closeness Problem: − Distinguish between the cases p = q and dist ( p , q ) > ε − Minimize sample size , computation time

This Work Simple Framework for Distribution Testing: Leads to sample-optimal and computationally efficient estimators for a variety of properties Primarily based on: A New Approach for Testing Properties of Discrete Distributions (I. Diakonikolas and D. Kane, FOCS’16)

Outline § Related and Prior Work § Framework Overview and Statement of Results § Case Study: Testing Identity, Closeness, and Independence § Future Directions and Concluding Remarks

Prior Work: Identity Testing Focus has been on arbitrary distributions over support of size . n Testing Identity to a known Distribution: O ( √ n/ ✏ 4 ) • [Goldreich-Ron’00]: upper bound for uniformity testing (collision statistics) e • [Batu et al., FOCS’01]: upper bound for testing O ( √ n ) · poly(1 / ✏ ) identity to any known distribution. O ( √ n/ ✏ 2 ) • [Paninski ’03]: upper bound of for uniformity testing, ✏ = Ω ( n − 1 / 4 ) Ω ( √ n/ ✏ 2 ) assuming . Lower bound of . • [Valiant-Valiant, FOCS’14, D-Kane-Nikishkin, SODA’15]: upper O ( √ n/ ✏ 2 ) bound of for identity testing to any known distribution. • [D-Gouleakis-Peebles-Price’16]: [GR’00] tester is optimal!

Prior Work: Closeness Testing Focus has been on arbitrary distributions over support of size . n Testing Closeness between two unknown distributions: O ( n 2 / 3 log n/ ✏ 8 / 3 ) • [Batu et al., FOCS’00]: upper bound for testing closeness between two unknown discrete distributions. Ω ( n 2 / 3 ) • [P. Valiant, STOC’08]: lower bound of for constant error. • [Chan-D-Valiant-Valiant, SODA’14]: tight upper and lower bound of O (max { n 2 / 3 / ✏ 4 / 3 , n 1 / 2 / ✏ 2 } ) • [Bhatacharya-Valiant, NIPS’15]: tight bounds for different sample ✏ > n − 1 / 12 sizes (assuming ).

Prior Work: Testing Independence Focus has been on arbitrary distributions over support of size . n Testing Independence of a distribution on : [ n ] × [ m ] . e O ( n 2 / 3 m 1 / 3 · poly(1 / ✏ )) • [Batu et al., FOCS’01]: upper bound. • [Levi-Ron-Rubinfeld, ICS’11]: lower bounds for constant error Ω ( m 1 / 2 n 1 / 2 ) Ω ( n 2 / 3 m 1 / 3 ) , for n = Ω ( m log m ) and • [Acharya-Daskalakis-Kamath, NIPS’15]: upper bound of O ( n/ ✏ 2 ) for n = m .

L2 Closeness Testing Lemma 1: Let be unknown distributions on a domain of size . p, q n There is an algorithm that uses O (min { k p k 2 , k q k 2 } n/ ✏ 2 ) samples from each of , and with probability at least 2/3 p, q distinguishes between the cases that and k p � q k 1 � ✏ . p = q Basic Tester [Chan-D-Valiant-Valiant’14] : Calculate Z = Σ i {( X i – Y i ) 2 – X i – Y i } • If Z > ε 2 m 2 then output “No” (different), otherwise, output “Yes” • (same) Collision-based estimator also works [D-Gouleakis-Peebles-Price’16]

Main New Idea Solve all problems by reducing to this as a black-box.

Framework and Results • Approach : Reduction of L1 Testing to L2 testing 1) Transform given distribution(s) to new distribution(s) (over potentially larger domain) with small L2 norm. 2) Use standard L2 tester as a black-box. • Circumvents method of explicitly learning heavy elements [Batu et al., FOCS’00]

Algorithmic Applications Sample Optimal Testers for: Simpler • Identity to a Fixed Distribution Proofs of • Closeness between two Unknown Distributions Known • (Nearly) Instance-optimal Identity Testing Results • Closeness with unequal sample size • Adaptive Closeness Testing • Independence (in any dimension) New • Properties of Collections of Distributions Results (Sample & Query model) • Testing Histograms • Other Metrics (chi-squared, Hellinger) All algorithms follow same pattern. Very simple analysis.

Warm-up: Testing Identity to Fixed Distribution (I) Let be unknown distribution and known distribution on . [ n ] q p Main Idea : “Stretch” the domain size to make L 2 norm of small. q • For every bin create set of new bins. i ∈ [ n ] d nq i e S i • Subdivide the probability mass of bin equally within . S i i Let be the new domain and the resulting distributions over . p 0 , q 0 S S q 0 q … [ n ] S

Warm-up: Testing Identity to Fixed Distribution (II) Let be unknown distribution and known distribution on . [ n ] p q L1 Identity Tester • Given , construct new domain . S q k p 0 � q 0 k 1 � ✏ . p 0 = q 0 • Use basic tester to distinguish between and We construct explicitly. Can sample from given sample from q 0 p 0 p. Analysis: k p 0 � q 0 k 1 = k p � q k 1 Observation 1: k q 0 k 2 = O (1 / p n ) Observation 2: and | S | ≤ 2 n By Lemma 1, we can test identity between and with sample size q 0 p 0 O ( k q 0 k 2 | S | / ✏ 2 ) = O ( p n/ ✏ 2 )

Identity Reduces to Uniformity • Summary of Previous Slides: Identity reduces to its special case when the explicit distribution has max probability O (1 /n ) . • Recent Improvement: [Oded Goldreich’16]: Identity Reduces to Uniformity.

Testing Closeness (I) Let be unknown distributions on . [ n ] p, q Main Idea : Use samples from to “stretch” the domain size. q • Draw a set of samples from . Poi( k ) S q • Let be the number of times we see in . i ∈ [ n ] S a i • Subdivide the mass of bin equally within new bins. i a i + 1 Let be the new domain and the resulting distributions over . p 0 , q 0 S 0 S 0 We can sample from . p 0 , q 0 k p 0 � q 0 k 1 = k p � q k 1 Observation :

Testing Closeness (II) Let be unknown distributions on . [ n ] p, q L1 Closeness Tester • Draw a set of samples from , construct new domain . Poi( k ) S 0 S q k p 0 � q 0 k 1 � ✏ . p 0 = q 0 • Use basic tester to distinguish between and p Claim : Whp and | S 0 | ≤ n + O ( k ) k q 0 k 2 = O (1 / k ) . Proof : 2 = P n k p 0 k 2 i =1 p 2 i / (1 + a i ) , E [1 / (1 + a i )]  1 / ( kp i ) . ⇤ By Lemma 1, we can test identity between and with sample size q 0 p 0 O ( k q 0 k 2 | S 0 | / ✏ 2 ) = O ( k � 1 / 2 · ( n + k ) / ✏ 2 ) . Total sample size O ( k + k − 1 / 2 · ( n + k ) / ✏ 2 ) . k := min { n, n 2 / 3 ✏ − 4 / 3 } . Set

Closeness with Unequal Samples Let be unknown distributions on . [ n ] p, q Have samples from and samples from m 1 + m 2 p. q m 2 L1 Closeness Tester Unequal • Set k := min { n, m 1 } . • Draw samples from , construct new domain . Poi( k ) S 0 q k p 0 � q 0 k 1 � ✏ . • Use basic tester to distinguish between and p 0 = q 0 p | S 0 | ≤ n + O ( k ) Claim : Whp and k q 0 k 2 = O (1 / k ) . p 0 By Lemma 1, we can test identity between and with sample size q 0 m 2 = O ( k q 0 k 2 | S 0 | / ✏ 2 ) = O ( k � 1 / 2 · ( n + k ) / ✏ 2 ) . By our choice of k , it follows m 2 = O (max { nm − 1 / 2 ✏ 2 , n 1 / 2 / ✏ 2 } ) . 1

Optimal Distribution Testing via Reductions Ilias Diakonikolas USC - PowerPoint PPT Presentation

Optimal Distribution Testing via Reductions Ilias Diakonikolas USC Joint work with Daniel Kane (UCSD) Distribution Testing Given samples from one or more unknown probability distributions, decide whether they satisfy a certain property.

CS 301 Lecture 20 Reductions Stephen Checkoway April 9, 2018 1 / 17 Reductions Reductions

Polynomial-time reductions We have seen several reductions: Polynomial-time reductions Informal

Recommended Round 2 March Budget Reductions GENERAL FUND SUMMARY TOTAL REDUCTIONS ROUNDS

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

1. Normal distribution 2. Geometric distribution 3. Binomial distribution 4.

proofs of proximity for distribution testing (Distribution testing now with proofs!) Tom Gur

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

Software testing Software Testing Introduction Testing levels Automated testing Principles and

1. Test page This page is for testing. This page is for testing. This page is for testing.

Near Optimal Dimensionality Reductions that Preserve Volumes RANDOM/APPROX 2008 Avner Magen

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Successful Emission Reductions in Yard Locomotives Yard Emissions Reductions Repower Slug

Budget Proposal Personnel/BOCES Personnel Reductions Personnel Reductions Assistant

On SI -groups R. R. Andruszkiewicz and M. Woronowicz University of Bia lystok 20.06.2014 M.

Trapdoor simulation of quantum algorithms Daniel J. Bernstein University of Illinois at Chicago

"I Can't Pay That!": S ocial S ecurity Overpayments and Low-Income Clients Kate

Search George Konidaris gdk@cs.duke.edu Spring 2016 (pictures: Wikipedia) Search Basic to

Horn clauses A literal is an atomic formula or its negation A clause is a disjunction of

More on functions Modified from materials by Mark Hansen, STAT 202a Functional programming

When things got really hard the disciples deserted Jesus Have you ever been deserted in your

HIGH STAKES HIGH AVAILABILITY STORMAGIC WEBINAR SERIES Bruce Kornfeld, CMO, StorMagic Mark

Optimal Distribution Testing via Reductions Ilias Diakonikolas USC - PowerPoint PPT Presentation

Optimal Distribution Testing via Reductions Ilias Diakonikolas USC Joint work with Daniel Kane (UCSD) Distribution Testing Given samples from one or more unknown probability distributions, decide whether they satisfy a certain property.

CS 301 Lecture 20 Reductions Stephen Checkoway April 9, 2018 1 / 17 Reductions Reductions

Polynomial-time reductions We have seen several reductions: Polynomial-time reductions Informal

Recommended Round 2 March Budget Reductions GENERAL FUND SUMMARY TOTAL REDUCTIONS ROUNDS

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

1. Normal distribution 2. Geometric distribution 3. Binomial distribution 4.

proofs of proximity for distribution testing (Distribution testing now with proofs!) Tom Gur

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

Software testing Software Testing Introduction Testing levels Automated testing Principles and

1. Test page This page is for testing. This page is for testing. This page is for testing.

Near Optimal Dimensionality Reductions that Preserve Volumes RANDOM/APPROX 2008 Avner Magen

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Successful Emission Reductions in Yard Locomotives Yard Emissions Reductions Repower Slug

Budget Proposal Personnel/BOCES Personnel Reductions Personnel Reductions Assistant

On SI -groups R. R. Andruszkiewicz and M. Woronowicz University of Bia lystok 20.06.2014 M.

Trapdoor simulation of quantum algorithms Daniel J. Bernstein University of Illinois at Chicago

&quot;I Can't Pay That!&quot;: S ocial S ecurity Overpayments and Low-Income Clients Kate

Search George Konidaris gdk@cs.duke.edu Spring 2016 (pictures: Wikipedia) Search Basic to

Horn clauses A literal is an atomic formula or its negation A clause is a disjunction of

More on functions Modified from materials by Mark Hansen, STAT 202a Functional programming

When things got really hard the disciples deserted Jesus Have you ever been deserted in your

HIGH STAKES HIGH AVAILABILITY STORMAGIC WEBINAR SERIES Bruce Kornfeld, CMO, StorMagic Mark

"I Can't Pay That!": S ocial S ecurity Overpayments and Low-Income Clients Kate