robust statistics and generative adversarial networks
play

Robust Statistics and Generative Adversarial Networks Yuan YAO - PowerPoint PPT Presentation

Robust Statistics and Generative Adversarial Networks Yuan YAO HKUST 1 Chao GAO Jiyi LIU Weizhi ZHU U. Chicago Yale U. HKUST 2 Deep Learning is Notoriously Not Robust! Imperceivable adversarial examples are ubiquitous to fail


  1. Robust Statistics and Generative Adversarial Networks Yuan YAO HKUST 1

  2. Chao GAO Jiyi LIU Weizhi ZHU U. Chicago Yale U. HKUST 2

  3. Deep Learning is Notoriously Not Robust! • Imperceivable adversarial examples are ubiquitous to fail neural networks • How can one achieve robustness? 3

  4. Robust Optimization • Traditional training: ✓ J n ( ✓ , z = ( x i , y i ) n min i =1 ) • e.g. square or cross-entropy loss as negative log-likelihood of logit models • Robust optimization (Madry et al. ICLR’2018): k ✏ i k � J n ( ✓ , z = ( x i + ✏ i , y i ) n min max i =1 ) ✓ • robust to any distributions, yet computationally hard 4

  5. Distributionally Robust Optimization (DRO) • Distributional Robust Optimization: min ✓ max E z ∼ P ✏ ∈ D [ J n ( ✓ , z )] ✏ • D is a set of ambiguous distributions, e.g. Wasserstein ambiguity set D = { P ✏ : W 2 ( P ✏ , uniform distribution) ≤ ✏ } where DRO may be reduced to regularized maximum likelihood estimates (Shafieezadeh-Abadeh, Esfahani, Kuhn, NIPS’2015) that are convex optimizations and tractable 5

  6. Wasserstein DRO and Sqrt-Lasso ( Jose Blanchet et al.’2016 ) Theorem (B., Kang, Murthy (2016)) Suppose that ( k x − x 0 k 2 ! ( x , y ) , ! x 0 , y 0 "" = y = y 0 if q c y 6 = y 0 . if ∞ Then, if 1 / p + 1 / q = 1 $% & 2 ' (% & 2 ) p Y − β T X Y − β T X P : D c ( P , P n ) ≤ δ E 1 / 2 = E 1 / 2 max + δ k β k p . P P n Remark 1: This is sqrt-Lasso (Belloni et al. (2011)). Remark 2: Uses RoPA duality theorem & "judicious choice of c ( · ) ” 6

  7. Certified Robustness of Lasso Take q = 1 and p = 1, with ( k x � x 0 k 2 y = y 0 if x 0 , y 0 �� � � 1 ( x , y ) , = c y 6 = y 0 1 if Then for n = 1 X P 0 δ x 0 n i i with k x i � x 0 i k 1  δ , Z D c ( P 0 π (( x , y ) , ( x 0 , y 0 )) c x 0 , y 0 �� � � n , P n ) = ( x , y ) ,  δ , for small enough δ and well-separated x ’s. Sqrt-Lasso � 2 p ⇢ ⇣ ⌘ 2 � E 1 / 2 Y � β T X min + δ k β k 1 P n β ✓⇣ ⌘ 2 ◆ Y � β T X = min max P : D c ( P , P n )  δ E P β provides a certified robust estimate in terms of Madry’s adversarial training, using a convex Wasserstein relaxation. 7

  8. TV-neighborhood • Now how about the TV-uncertainty set? D = { P ✏ : TV ( P ✏ , uniform distribution) ≤ ✏ } ? • an example from robust statistics … 8

  9. Huber’s Model X 1 , ..., X n ⇠ (1 � ✏ ) P ✓ + ✏ Q contamination proportion arbitrary contamination parameter of interest [Huber 1964] 9

  10. An Example X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? how to estimate ? 10

  11. Robust Maxmum-Likelihood Does not work! X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? n ( ✓ − X i ) 2 X ` ( ✓ , Q ) = negative log-likelihood = i =1 ∼ (1 − ✏ ) E N ( θ ) ( ✓ − X ) 2 + ✏ E Q ( ✓ − X ) 2 the sample mean n ✓ mean = 1 ˆ X X i = arg min θ ` ( ✓ , Q ) n i =1 ` (ˆ min θ max ` ( ✓ , Q ) ≥ max min θ ` ( ✓ , Q ) = max ✓ mean , Q ) = ∞ Q Q Q 11

  12. Medians 1. Coordinatewise median ✓ = (ˆ ˆ ✓ j ), where ˆ ✓ j = Median( { X ij } n i =1 ); 2. Tukey’s median Estimator 2: n 1 I { u T X i > u T ⌘ } . ˆ X ✓ = arg max η ∈ R p min n || u || =1 i =1 12

  13. Comparisons Coordinatewise Median Tukey’s Median breakdown point 1 / 2 1 / 3 p p statistical precision n n (no contamination) p p n + p ✏ 2 n + ✏ 2 : minimax statistical precision (with contamination) [Chen-Gao-Ren’15] computational complexity Polynomial NP-hard [Amenta et al. ’00] Note: R-package for Tukey median can not deal with more than 10 dimensions! [https://github.com/ChenMengjie/DepthDescent] 13

  14. Depth and Statistical Properties 14

  15. Multivariate Location Depth ( n n ) 1 I { u T X i > u T ⌘ } ^ 1 I { u T X i  u T ⌘ } ˆ X X ✓ = arg max η 2 R p min n n k u k =1 i =1 i =1 Estimator 2: n 1 I { u T X i > u T ⌘ } . ˆ X ✓ = arg max η ∈ R p min n || u || =1 i =1 [Tukey, 1975] 15

  16. Regression Depth y | X ∼ N ( X T β , σ 2 ) model Xy | X ∼ N ( XX T β , σ 2 XX T ) embedding u T Xy | X ∼ N ( u T XX T β , σ 2 u T XX T u ) projection ( n n ) 1 i η ) > 0 } ∧ 1 I { u T X i ( y i − X T I { u T X i ( y i − X T ˆ X X β = argmax min i η ) ≤ 0 } n n u ∈ R p η ∈ R p i =1 i =1 [Rousseeuw & Hubert, 1999] 16

  17. Tukey’s depth is not a special case of regression depth. 17

  18. Multi-task Regression Depth ( X, Y ) ∈ R p × R m ∼ P of B ∈ R p × m population version: U T X, Y − B T X �⌦ ↵ D U ( B, P ) = inf ≥ 0 U ∈ U P empirical version: n 1 U T X i , Y i − B T X i X D U ( B, { ( X i , Y i ) } n �⌦ ↵ i =1 ) = inf ≥ 0 I n U ∈ U i =1 [Mizera, 2002] 18

  19. Multi-task Regression Depth U T X, Y � B T X �⌦ ↵ D U ( B, P ) = inf � 0 U 2 U P p = 1 , X = 1 ∈ R , u T ( Y − b ) ≥ 0 � D U ( b, P ) = inf u ∈ U P m = 1 , u T X ( y − β T X ) ≥ 0 � D U ( β , P ) = inf U ∈ U P 19

  20. Multi-task Regression Depth Estimation Error. For any , y � > 0 , r r pm log(1 / � ) B 2 R p × m |D ( B, P n ) � D ( B, P ) |  C sup n + , 2 n with probability at least . st 1 � 2 � Contamination Error. sup |D ( B, (1 � ✏ P B ∗ ) + ✏ Q ) � D ( B, P B ∗ ) |  ✏ B,Q 20

  21. Multi-task Regression Depth � Y | X ∼ N ( B T X, � 2 I m ) ( X, Y ) ∼ P B : X ∼ N (0 , Σ ) , ( X 1 , Y 1 ) , ..., ( X n , Y n ) ∼ (1 − ✏ ) P B + ✏ Q Theorem [G17]. For some C > 0 , B � B ))  C � 2 ⇣ pm n _ ✏ 2 ⌘ B � B ) T Σ ( b Tr (( b , ⇣ pm n _ ✏ 2 ⌘ F  C � 2 k b B � B k 2 ,  2 � � �� with high probability uniformly over . B, Q 21

  22. Covariance Matrix X 1 , ..., X n ⇠ (1 � ✏ ) N (0 , Σ ) + ✏ Q . How to estimate Σ ? how to estimate ? 22

  23. Covariance Matrix 23

  24. Covariance Matrix ( n n ) 1 I {| u T X i | 2 � u T Γ u } , 1 I {| u T X i | 2 < u T Γ u } X X D ( Γ , { X i } n i =1 ) = min k u k =1 min n n i =1 i =1 ˆ Γ ⌫ 0 D ( Γ , { X i } n Σ = ˆ ˆ Γ = arg max i =1 ) Γ / � , Theorem [CGR15]. For some C > 0 , ⇣ p n _ ✏ 2 ⌘ k ˆ Σ � Σ k 2 op  C with high probability uniformly over . Σ , Q 24

  25. Summary ⇣ p n _ ✏ 2 ⌘ k · k 2 mean � 2 _ � 2 reduced rank r ( p + m ) k · k 2  2 ✏ 2 regression F  2 n s 2 log( ep/s ) Gaussian graphical k · k 2 _ s ✏ 2 ` 1 model n ⇣ p n _ ✏ 2 ⌘ k · k 2 covariance matrix op _ ✏ 2 s log( ep/s ) k · k 2 sparse PCA F n � 2 � 2 25

  26. Computation 26

  27. Computational Challenges X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? Lai, Rao, Vempala Diakonikolas, Kamath, Kane, Li, Moitra, Stewart Balakrishnan, Du, Singh Dalalyan, Carpentier, Collier, Verzelen • Polynomial algorithms are proposed [Diakonikolas et al.’16, Lai et al. 16] of minimax optimal statistical precision • needs information on second or higher order of moments • some priori knowledge about ✏ 27

  28. Advantages of Tukey Median • A well-defined objective function ✏ • Adaptive to and Σ • Optimal for any elliptical distribution 28

  29. A practically good algorithm? 29

  30. Generative Adversarial Networks [Goodfellow et al. 2014] men- we , lin- al- exceeds the Note: R-package for Tukey median can not deal with more than 10 dimensions [https://github.com/ChenMengjie/ DepthDescent] 30

  31. Robust Learning of Cauchy Distributions Table 4: Comparison of various methods of robust location estimation under Cauchy distributions. Samples are drawn from (1 � ✏ ) Cauchy (0 p , I p ) + ✏ Q with ✏ = 0 . 2 , p = 50 and various choices of Q . Sample size: 50,000. Discriminator net structure: 50-50-25-1. Generator g ω ( ⇠ ) structure: 48-48-32-24-12-1 with absolute value activation function in the output layer. Contamination Q JS-GAN ( G 1 ) JS-GAN ( G 2 ) Dimension Halving Iterative Filtering Cauchy (1 . 5 ⇤ 1 p , I p ) 0.0664 (0.0065) 0.0743 (0.0103) 0.3529 (0.0543) 0.1244 (0.0114) Cauchy (5 . 0 ⇤ 1 p , I p ) 0.0480 (0.0058) 0.0540 (0.0064) 0.4855 (0.0616) 0.1687 (0.0310) Cauchy (1 . 5 ⇤ 1 p , 5 ⇤ I p ) 0.0754 (0.0135) 0.0742 (0.0111) 0.3726 (0.0530) 0.1220 (0.0112) Normal (1 . 5 ⇤ 1 p , 5 ⇤ I p ) 0.0702 (0.0064) 0.0713 (0.0088) 0.3915 (0.0232) 0.1048 (0.0288)) • Dimension Halving: [Lai et al.’16] https://github.com/kal2000/AgnosticMeanAndCovarianceCode . • Iterative Filtering: [Diakonikolas et al.’17] https://github.com/hoonose/robust-filter . 31

  32. f-GAN Given a strictly convex function f that satisfies f (1) = 0, the f -divergence between two probability distributions P and Q is defined by ✓ p ◆ Z D f ( P k Q ) = dQ . (8) f q Let f ⇤ be the convex conjugate of f . A variational lower bound of (8) is [ E P T ( X ) � E Q f ⇤ ( T ( X ))] . D f ( P k Q ) � sup (9) T 2 T where equality holds whenever the class T contains the function f 0 ( p / q ). [Nowozin-Cseke-Tomioka’16] f -GAN minimizes the variational lower bound (9) " # n X 1 b T ( X i ) � E Q f ⇤ ( T ( X )) P = arg min sup . (10) n Q 2 Q T 2 T i =1 with i.i.d. observations X 1 , ..., X n ⇠ P . 32

Recommend


More recommend