failing loudly detecting dataset shift
play

Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 - PowerPoint PPT Presentation

Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 rabans@amazon.com unnemann 2 guennemann@in.tum.de Prof. Dr. Stephan G Prof. Dr. Zachary C. Lipton 3 zlipton@cmu.edu 1 Amazon 2 Technical University of Munich 3 Carnegie Mellon


  1. Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 rabans@amazon.com unnemann 2 guennemann@in.tum.de Prof. Dr. Stephan G¨ Prof. Dr. Zachary C. Lipton 3 zlipton@cmu.edu 1 Amazon 2 Technical University of Munich 3 Carnegie Mellon University AWS AI Labs Department of Informatics Tepper School of Business Data Analytics and Machine Learning Machine Learning Department January 24, 2020

  2. Table of contents Motivation & Overview Methods Experiments Conclusion Failing Loudly: Detecting Dataset Shift 2

  3. Motivation & Overview

  4. Motivation • The reliable functioning of software depends crucially on tests. • Despite their power, ML models are sensitive to shifts in the data distribution. • ML pipelines rarely inspect incoming data for signs of distribution shift. • Best practices for testing equivalence of the source distribution p and the target distribution q in real-life, high-dim. data settings have not yet been established. • Existing solutions to addressing covariate shift q ( x , y ) = q ( x ) p ( y | x ) or label shift q ( x , y ) = q ( y ) p ( x | y ) often rely on strict preconditions, producing wrong predictions if not met. Failing Loudly: Detecting Dataset Shift 4

  5. Shift Detection Overview Faced with distribution shift, our goals are three-fold: • detect when distribution shift occurs from as few examples as possible; • characterize the shift (e.g. by identifying those samples from the test set that appear over-represented in the target data); and • provide some guidance on whether the shift is harmful or not. … … … … … … … Combined Test Statistic x source x source Two-Sample Test(s) & Shift Detection Dimensionality … … … … Reduction x target Failing Loudly: Detecting Dataset Shift 5

  6. Methods

  7. Our Framework Given labeled data ( x 1 , y 1 ) , ..., ( x n , y n ) ∼ p and unlabeled data x ′ 1 , ..., x ′ m ∼ q , our task is to determine whether p ( x ) equals q ( x ′ ): H 0 : p ( x ) = q ( x ′ ) H A : p ( x ) � = q ( x ′ ) . vs We explore the following design choices: • what representation to run the test on; • which two-sample test to run; • when the representation is multidimensional; whether to run a single multivariate test or multiple univariate two-sample tests ; and • how to combine their results. Failing Loudly: Detecting Dataset Shift 7

  8. Dimensionality Reduction Techniques: NoRed & PCA No Reduction (NoRed ): Principal Components Analysis (PCA ): … … • To justify the use of any DR • Find an optimal orthogonal transf. technique, our default baseline is to matrix such that points are linearly run tests on the original raw features. uncorrelated after transf. Failing Loudly: Detecting Dataset Shift 8

  9. Dimensionality Reduction Techniques: SRP & AE Sparse Random Projection (SRP ): Autoencoders (TAE and UAE ): … … … + � v  1 with prob. • Encoder φ : X → L 2 v K  R ij = with prob. 1 − 1 0 v • Decoder ψ : L → X − � v 1 with prob.  K 2 v φ, ψ = arg min φ,ψ � X − ( ψ ◦ φ ) X � 2 1 with v = √ D Failing Loudly: Detecting Dataset Shift 9

  10. Dimensionality Reduction Techniques: BBSD & Classif Domain Classifier (Classif × ): Label Classifiers (BBSDs ⊳ and BBSDh ⊲ ): source C -many … classes target … … • Label classifier with softmax outputs • Explicitly train a domain classifier to (BBSDs ⊳ ) or hard-thresholded discriminate between data from source predictions (BBSDh ⊲ ). and target domains. Failing Loudly: Detecting Dataset Shift 10

  11. Statistical Hypothesis Testing: Maximum Mean Discrepancy (MMD) • Popular kernel-based technique for • Empirical estimate: multivariate two-sample testing. m m 1 MMD 2 = • Distinguish two distrib. based on their � � κ ( x i , x j ) m ( m − 1) mean embeddings µ p and µ q in a i =1 j � = i reproducing kernel Hilbert space F : n n 1 � � κ ( x ′ i , x ′ + j ) n ( n − 1) MMD( F , p , q ) = || µ p − µ q || 2 i =1 j � = i F m n − 2 � � κ ( x i , x ′ j ) p mn MMD i =1 j =1 q • Kernel: κ ( x 1 , x 2 ) = e − 1 σ � x 1 − x 2 � 2 • Used with NoRed , PCA , SRP , TAE , UAE , and BBSDs ⊳ . Failing Loudly: Detecting Dataset Shift 11

  12. Statistical Hypothesis Testing: Kolmogorov-Smirnov + Bonferroni • Test each of the K dimensions • Multiple hypothesis testing: we must separately (instead of jointly) using subsequently combine the p -values the Kolmogorov-Smirnov (KS) test. from the K -many test. • Largest difference S of the cumulative • Problem: We cannot make strong density functions over all values z : assumptions about the (in)dependence among the tests. S = sup z | F p ( z ) − F q ( z ) | • Solution: Bonferroni correction: • Does not assume (in)dependence. • Bounds the family-wise error rate, F p S i.e. it is a conservative aggregation. F • Rejects H 0 if p min ≤ q α K . • Used with NoRed , PCA , SRP , TAE , UAE , and BBSDs ⊳ . Failing Loudly: Detecting Dataset Shift 12

  13. Statistical Hypothesis Testing: Chi-Squared Test • Evaluate whether the freq. distr. of � Sample Cat 1 · · · Cat C certain events observed in a sample is · · · p n p 1 n pC n p • consistent with a particular theo. distr. · · · q n q 1 n qC n q • • Difference can be calculated as � n • 1 · · · n • C N sum 2 C ( O ij − E ij ) 2 X 2 = � � E ij i =1 j =1 with observed counts O ij and expected counts E ij = N sum p i • p • j with Rejection N sum = � c n i • n ij • p i • = N sum and j =1 N sum = � r n • j n ij • p • j = N sum . i =1 • Under H 0 , X 2 ∼ χ 2 C − 1 . • Used with BBSDh ⊲ . Failing Loudly: Detecting Dataset Shift 13

  14. Statistical Hypothesis Testing: Binomial Test • Compare difference classifier accuracy (acc) on held-out data to random chance via a binomial test. H 0 : acc = 0 . 5 vs H A : acc > 0 . 5 Rejection • Under H 0 , the acc follows a binomial distribution acc ∼ Bin( N hold , 0 . 5) where N hold corresponds to the • Used with Classif × . number of held-out samples. Failing Loudly: Detecting Dataset Shift 14

  15. Obtaining Most Anomalous Samples • Recall: our detection framework does not detect outliers but rather aims at capturing top-level shift dynamics. • We can not decide whether any given sample is in- or out-of-distribution. • But: we can harness domain assignments from the domain classifier. • It is easy to identify the exemplars which the domain classifier was most confident in assigning to the target domain. • Other shift detectors compare entire distributions against each other. • Identification of samples which if removed would lead to a large increase in the overall p -value was not successful. Failing Loudly: Detecting Dataset Shift 15

  16. Determining the Malignancy of a Shift • Distribution shifts can cause arbitrarily severe degradation in performance. • In practice distributions shift constantly and often these changes are benign. • Goal: distinguishing malignant shifts from benign shifts. • Problem: although prediction quality can be assessed easily on source data, we are not able compute the target error directly without labels. • Heuristic methods for approximating the target performance: • Difference classifier assignments: assess black-box model’s accuracy on the labeled top anomalous samples ( implicit shift characterization). • Domain expert: Get hints on the target accuracy by evaluating the classifier on held-out source data that has been explicitly perturbed by a function determined by a domain expert. Failing Loudly: Detecting Dataset Shift 16

  17. Experiments

  18. Experimental Setup • Core experiments: synthetic shifts on MNIST and CIFAR-10 image datasets. • Autoencoders: convolutional architecture with 3 convolutional layers. • BBSD and Classif: ResNet-18 architecture. , BBSDs ⊳ , BBSDh ⊲ , Classif × ): SGD with • Network training (TAE momentum in batches of 128 examples over 200 epochs with early stopping. • Dimensionality reduction to K = 32 (PCA , SRP , UAE , and TAE ), C = 10 (BBSDs ⊳ ), and 1 (BBSDh ⊲ and Classif × ). • Evaluate shift detection at a significance level of α = 0 . 05. • Shift detection performance is averaged over a total of 5 random splits. • Randomly split the data into training, validation, and test sets and then apply a particular shift to the test set only. • Evaluate the models with various amounts of samples from the test set s ∈ { 10 , 20 , 50 , 100 , 200 , 500 , 1000 , 10000 } . Failing Loudly: Detecting Dataset Shift 18

Recommend


More recommend