Learning Statistical Property Testers Sreeram Kannan University of Washington Seattle
Collaborators Rajat Karthikeyan Sudipto Himanshu Arman Sen Shanmugan Mukherjee Asnani Rahimzamani UT, Austin IBM Research University of Washington, Seattle
Statistical Property Testing ✤ Closeness testing ✤ Independence testing ✤ Conditional Independence testing ✤ Information estimation
Testing Total Variation Distance 100 100 75 75 50 50 25 25 Q P
Testing Total Variation Distance 100 100 75 75 50 50 25 25 Q P n samples n samples
Testing Total Variation Distance 100 100 75 75 50 50 25 25 Q P n samples n samples Estimate D T V ( P, Q ) ?
Testing Total Variation Distance 100 100 75 75 50 50 25 25 Q P n samples n samples Estimate D T V ( P, Q ) ? P and Q can be arbitrary. Search beyond Traditional Density Estimation Methods
Testing Total Variation: Prior Art ✤ Lots of work in CS theory on D TV testing ✤ Based on closeness testing between P and Q ✤ Sample complexity = O(n a ), where n = alphabet size ✤ Curse of dimensionality if n = 2 d Complexity is O(2 ad ) * Chan et al, Optimal Algorithms for testing * Sriperumbudur et al, Kernel choice and classifiability for closeness of discrete distributions , SODA 2014 RKHS embeddings of probability distributions, NIPS 2009
Classifiers beat curse-of-dimensionality ✤ Deep NN and boosted random forests achieve state-of-the-art performance ✤ Works very well even in practice when X is high dimensional. ✤ Exploits generic inductive bias: ✤ Invariance ✤ Hierarchical Structure ✤ Symmetry Theoretical guarantees lag severely behind practice!
Distance Estimation via Classification 100 100 75 75 50 50 25 25 n samples ∼ Q n samples ∼ P
Distance Estimation via Classification 100 100 75 75 50 50 25 25 n samples ∼ Q n samples ∼ P (Label 1) (Label 0) Classifier
Distance Estimation via Classification 100 100 75 75 50 50 25 25 n samples ∼ Q n samples ∼ P (Label 1) (Label 0) Classifier Classification Error 1 2 − 1 of Optimal Bayes = 2 D TV ( P, Q ). Classifier
Distance Estimation via Classification 100 100 75 75 50 50 25 25 n samples ∼ Q n samples ∼ P (Label 1) (Label 0) Deep NN, Boosted Trees etc. Classification Error of 1 2 − 1 = 2 D TV ( P, Q ). Optimal Classifier * Sriperumbudur et al, Kernel choice and classifiability for * Lopez-Paz et al, Revisiting Classifier two-sample RKHS embeddings of probability distributions, NIPS 2009 tests , ICLR 2017
Distance Estimation via Classification 100 100 75 75 50 50 25 25 n samples ∼ Q n samples ∼ P (Label 1) (Label 0) Deep NN, Boosted Trees etc. Can get P-value control Classification Error of 1 2 − 1 >= 2 D TV ( P, Q ). Any Classifier * Sriperumbudur et al, Kernel choice and classifiability for * Lopez-Paz et al, Revisiting Classifier two-sample RKHS embeddings of probability distributions, NIPS 2009 tests , ICLR 2017
Independence Testing n samples { x i , y i } n i =1 * Sriperumbudur et al, Kernel choice and classifiability for * Lopez-Paz et al, Revisiting Classifier two-sample RKHS embeddings of probability distributions, NIPS 2009 tests , ICLR 2017
Independence Testing n H 0 : X || Y ( P CI ) n samples { x i , y i } n i =1 H 1 : X 6? ? Y ( P )
Independence Testing n H 0 : X || Y ( P CI ) n samples { x i , y i } n i =1 H 1 : X 6? ? Y ( P ) P ( p ( x, y )) P CI ( p ( x ) p ( y )) Classify
Independence Testing n H 0 : X || Y ( P CI ) n samples { x i , y i } n i =1 H 1 : X 6? ? Y ( P ) P ( p ( x, y )) P CI ( p ( x ) p ( y )) P CI ( p ( x ) p ( y )) Classify
Independence Testing n H 0 : X || Y ( P CI ) n samples { x i , y i } n i =1 H 1 : X 6? ? Y ( P ) P ( p ( x, y )) P CI ( p ( x ) p ( y )) P CI ( p ( x ) p ( y )) Permutation Classify
Independence Testing n samples { x i , y i } n i =1 Split Equally
Independence Testing n samples { x i , y i } n i =1 Split Equally P ( p ( x, y ))
Independence Testing n samples { x i , y i } n i =1 Split Equally Label 0 P ( p ( x, y ))
Independence Testing n samples { x i , y i } n i =1 Split Equally Label 0 y i ’s are permuted P ( p ( x, y ))
Independence Testing n samples { x i , y i } n i =1 Split Equally Label 0 y i ’s are permuted P ( p ( x, y )) P CI ( p ( x ) p ( y ))
Independence Testing n samples { x i , y i } n i =1 Split Equally Label 0 y i ’s are permuted P ( p ( x, y )) P CI ( p ( x ) p ( y )) Label 1
Independence Testing n samples { x i , y i } n i =1 Split Equally Label 0 y i ’s are permuted P ( p ( x, y )) P CI ( p ( x ) p ( y )) P-value control Label 1 *Lopez-Paz et al, Revisiting Classifier two-sample * Sriperumbudur et al, Kernel choice and classifiability for tests , ICLR 2017 RKHS embeddings of probability distributions, NIPS 2009
Conditional Independence Testing n H 0 : X || Y | Z ( P CI ) n samples { x i , y i , z i } n vs i =1 H 1 : X 6? ? Y | Z ( P )
Conditional Independence Testing n H 0 : X || Y | Z ( P CI ) n samples { x i , y i , z i } n vs i =1 H 1 : X 6? ? Y | Z ( P ) P ( p ( x, y, z )) P CI ( p ( z ) p ( x | z ) p ( y | z )) Classify
Conditional Independence Testing n H 0 : X || Y | Z ( P CI ) n samples { x i , y i , z i } n vs i =1 H 1 : X 6? ? Y | Z ( P ) P ( p ( x, y, z )) How to get P CI ( p ( z ) p ( x | z ) p ( y | z )? P CI ( p ( z ) p ( x | z ) p ( y | z )) Classify
Conditional Independence Testing n H 0 : X || Y | Z ( P CI ) n samples { x i , y i , z i } n vs i =1 H 1 : X 6? ? Y | Z ( P ) P ( p ( x, y, z )) Given samples ∼ p ( x, z ) How to emulate p ( y | z )? P CI ( p ( z ) p ( x | z ) p ( y | z )) Classify
Conditional Independence Testing n H 0 : X || Y | Z ( P CI ) n samples { x i , y i , z i } n vs i =1 H 1 : X 6? ? Y | Z ( P ) P ( p ( x, y, z )) Emulate p ( y | z ) as q ( y | z ) ✤ KNN Based Methods P CI ( p ( z ) p ( x | z ) p ( y | z )) ✤ Kernel Methods Classify
Conditional Independence Testing n H 0 : X || Y | Z ( P CI ) n samples { x i , y i , z i } n vs i =1 H 1 : X 6? ? Y | Z ( P ) P ( p ( x, y, z )) Emulate p ( y | z ) as q ( y | z ) ✤ KNN Based ˜ P CI ( p ( z ) p ( x | z ) q ( y | z )) Methods ˜ P CI ( p ( z ) p ( x | z ) q ( y | z )) ✤ Kernel Methods Classify
Conditional Independence Testing n H 0 : X || Y | Z ( P CI ) n samples { x i , y i , z i } n vs i =1 H 1 : X 6? ? Y | Z ( P ) P ( p ( x, y, z )) Emulate p ( y | z ) as q ( y | z ) ✤ [KCIT] Gretton et al, Kernel-based conditional independence test and application in causal discovery, NIPS 2008 ✤ KNN Based ✤ [KCIPT] Doran et al, A permutation-based kernel conditional ˜ P CI ( p ( z ) p ( x | z ) q ( y | z )) Methods independence test, UAI 2014 ˜ P CI ( p ( z ) p ( x | z ) q ( y | z )) ✤ [CCIT] Sen et al, Model-Powered Conditional Independence Test , NIPS ✤ Kernel 2017 ✤ [RCIT] Strobl et al, Approximate Kernel-based Conditional Independence Methods Tests for Fast Non-Parametric Causal Discovery, arXiv Classify
Conditional Independence Testing n H 0 : X || Y | Z ( P CI ) n samples { x i , y i , z i } n vs i =1 H 1 : X 6? ? Y | Z ( P ) P ( p ( x, y, z )) Emulate p ( y | z ) as q ( y | z ) ✤ Limited to low-dimensional Z. ✤ KNN Based ˜ P CI ( p ( z ) p ( x | z ) q ( y | z )) Methods In practice, Z is often high dimensional. ˜ P CI ( p ( z ) p ( x | z ) q ( y | z )) ✤ Kernel (Eg. In graphical model, conditioning set can be Methods entire graph.) Classify
Generative Models beat curse-of-dimensionality z x Generator Low-dimensional High-dimensional Latent Space data Space
Generative Models beat curse-of-dimensionality z x Generator Low-dimensional High-dimensional Latent Space data Space ✤ Trained Real Samples of x ✤ Can generate any number of new samples
Generative Models beat curse-of-dimensionality z x Generator Low-dimensional High-dimensional Latent Space data Space ✤ Trained Real Samples of x ✤ Can generate any number of new samples
P CI or q ( y | z )? How loose can the estimate be for ˜
P CI or q ( y | z )? How loose can the estimate be for ˜ As long as the density function q ( y | z ) > 0 whenever p ( y , z ) > 0. Mimic-and-Classify works
P CI or q ( y | z )? How loose can the estimate be for ˜ Novel Bias Cancellation Method in Mimic-and-Classify works As long as the density function q ( y | z ) > 0 whenever p ( y , z ) > 0. Mimic Functions : GANs, Regressors etc.
Mimic and Classify Mimic Step Classify Step
Mimic and Classify 100 Mimic 50 Step D ∼ p ( x, y, z ) Classify Step
Mimic and Classify 100 100 Mimic 50 50 Step D ∼ p ( x, y, z ) D 2 ∼ p ( x, y, z ) 100 50 D 1 ∼ p ( x, y, z ) Classify Step
Mimic and Classify 100 100 Mimic 50 50 Step D ∼ p ( x, y, z ) D 2 ∼ p ( x, y, z ) Dataset D 2 (x i , y i , z i ) z i y’ i (x i , y’ i , z i ) MIMIC 100 Dataset D’ 50 D 1 ∼ p ( x, y, z ) Classify Step
Mimic and Classify 100 100 Mimic 50 50 Step D ∼ p ( x, y, z ) D 2 ∼ p ( x, y, z ) MIMIC 100 100 50 50 D 1 ∼ p ( x, y, z ) D 0 ∼ p ( z ) p ( x | z ) q ( y | z ) Classify Step
Recommend
More recommend