robust statistics and generative adversarial networks
play

Robust Statistics and Generative Adversarial Networks Yuan YAO - PowerPoint PPT Presentation

Robust Statistics and Generative Adversarial Networks Yuan YAO HKUST Chao Gao (Chicago) Jiyu Liu (Yale) Weizhi Zhu (HKUST) Deep Learning is Notoriously Not Robust! Imperceivable adversarial examples are ubiquitous to fail neural networks


  1. Multi-task Regression Depth � Y | X ∼ N ( B T X, � 2 I m ) ( X, Y ) ∼ P B : X ∼ N (0 , Σ ) , ( X 1 , Y 1 ) , ..., ( X n , Y n ) ∼ (1 − ✏ ) P B + ✏ Q Theorem [G17]. For some C > 0 , B � B ))  C � 2 ⇣ pm n _ ✏ 2 ⌘ B � B ) T Σ ( b Tr (( b , ⇣ pm n _ ✏ 2 ⌘ F  C � 2 k b B � B k 2 ,  2 � � �� with high probability uniformly over . B, Q

  2. Covariance Matrix X 1 , ..., X n ⇠ (1 � ✏ ) N (0 , Σ ) + ✏ Q . How to estimate Σ ?

  3. Covariance Matrix X 1 , ..., X n ⇠ (1 � ✏ ) N (0 , Σ ) + ✏ Q . How to estimate Σ ? how to estimate ?

  4. Covariance Matrix

  5. Covariance Matrix

  6. Covariance Matrix

  7. Covariance Matrix

  8. Covariance Matrix

  9. Covariance Matrix ( n n ) 1 I {| u T X i | 2 � u T Γ u } , 1 I {| u T X i | 2 < u T Γ u } X X D ( Γ , { X i } n i =1 ) = min k u k =1 min n n i =1 i =1

  10. Covariance Matrix ( n n ) 1 I {| u T X i | 2 � u T Γ u } , 1 I {| u T X i | 2 < u T Γ u } X X D ( Γ , { X i } n i =1 ) = min k u k =1 min n n i =1 i =1 ˆ Γ ⌫ 0 D ( Γ , { X i } n Σ = ˆ ˆ Γ = arg max i =1 ) Γ / � ,

  11. Covariance Matrix ( n n ) 1 I {| u T X i | 2 � u T Γ u } , 1 I {| u T X i | 2 < u T Γ u } X X D ( Γ , { X i } n i =1 ) = min k u k =1 min n n i =1 i =1 ˆ Γ ⌫ 0 D ( Γ , { X i } n Σ = ˆ ˆ Γ = arg max i =1 ) Γ / � , Theorem [CGR15]. For some C > 0 , ⇣ p n _ ✏ 2 ⌘ k ˆ Σ � Σ k 2 op  C with high probability uniformly over . Σ , Q

  12. Summary ⇣ p n _ ✏ 2 ⌘ k · k 2 mean � 2 _ � 2 reduced rank r ( p + m ) k · k 2  2 ✏ 2 regression F  2 n s 2 log( ep/s ) Gaussian graphical k · k 2 _ s ✏ 2 ` 1 model n ⇣ p n _ ✏ 2 ⌘ k · k 2 covariance matrix op _ ✏ 2 s log( ep/s ) k · k 2 sparse PCA F n � 2 � 2

  13. Summary ⇣ p n _ ✏ 2 ⌘ k · k 2 mean � 2 _ � 2 reduced rank r ( p + m ) k · k 2  2 ✏ 2 regression F  2 n s 2 log( ep/s ) Gaussian graphical k · k 2 _ s ✏ 2 ` 1 model n ⇣ p n _ ✏ 2 ⌘ k · k 2 covariance matrix op _ ✏ 2 s log( ep/s ) k · k 2 sparse PCA F n � 2 � 2

  14. Computation

  15. Computational Challenges X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ?

  16. Computational Challenges X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? Lai, Rao, Vempala Diakonikolas, Kamath, Kane, Li, Moitra, Stewart Balakrishnan, Du, Singh Dalalyan, Carpentier, Collier, Verzelen

  17. Computational Challenges X 1 , ..., X n ⇠ (1 � ✏ ) N ( ✓ , I p ) + ✏ Q . How to estimate ✓ ? Lai, Rao, Vempala Diakonikolas, Kamath, Kane, Li, Moitra, Stewart Balakrishnan, Du, Singh Dalalyan, Carpentier, Collier, Verzelen • Polynomial algorithms are proposed [Diakonikolas et al.’16, Lai et al. 16] of minimax optimal statistical precision • needs information on second or higher order of moments • some priori knowledge about ✏

  18. Advantages of Tukey Median • A well-defined objective function ✏ • Adaptive to and Σ • Optimal for any elliptical distribution

  19. Advantages of Tukey Median • A well-defined objective function ✏ • Adaptive to and Σ • Optimal for any elliptical distribution

  20. Advantages of Tukey Median • A well-defined objective function ✏ • Adaptive to and Σ • Optimal for any elliptical distribution

  21. Advantages of Tukey Median • A well-defined objective function ✏ • Adaptive to and Σ • Optimal for any elliptical distribution

  22. A practically good algorithm?

  23. Generative Adversarial Networks [Goodfellow et al. 2014] men- we , lin- al- exceeds the Note: R-package for Tukey median can not deal with more than 10 dimensions [https://github.com/ChenMengjie/ DepthDescent]

  24. Robust Learning of Cauchy Distributions Table 4: Comparison of various methods of robust location estimation under Cauchy distributions. Samples are drawn from (1 � ✏ ) Cauchy (0 p , I p ) + ✏ Q with ✏ = 0 . 2 , p = 50 and various choices of Q . Sample size: 50,000. Discriminator net structure: 50-50-25-1. Generator g ω ( ⇠ ) structure: 48-48-32-24-12-1 with absolute value activation function in the output layer. Contamination Q JS-GAN ( G 1 ) JS-GAN ( G 2 ) Dimension Halving Iterative Filtering Cauchy (1 . 5 ⇤ 1 p , I p ) 0.0664 (0.0065) 0.0743 (0.0103) 0.3529 (0.0543) 0.1244 (0.0114) Cauchy (5 . 0 ⇤ 1 p , I p ) 0.0480 (0.0058) 0.0540 (0.0064) 0.4855 (0.0616) 0.1687 (0.0310) Cauchy (1 . 5 ⇤ 1 p , 5 ⇤ I p ) 0.0754 (0.0135) 0.0742 (0.0111) 0.3726 (0.0530) 0.1220 (0.0112) Normal (1 . 5 ⇤ 1 p , 5 ⇤ I p ) 0.0702 (0.0064) 0.0713 (0.0088) 0.3915 (0.0232) 0.1048 (0.0288)) • Dimension Halving: [Lai et al.’16] https://github.com/kal2000/AgnosticMeanAndCovarianceCode . • Iterative Filtering: [Diakonikolas et al.’17] https://github.com/hoonose/robust-filter .

  25. f-GAN Given a strictly convex function f that satisfies f (1) = 0, the f -divergence between two probability distributions P and Q is defined by ✓ p ◆ Z D f ( P k Q ) = dQ . (8) f q Let f ⇤ be the convex conjugate of f . A variational lower bound of (8) is [ E P T ( X ) � E Q f ⇤ ( T ( X ))] . D f ( P k Q ) � sup (9) T 2 T where equality holds whenever the class T contains the function f 0 ( p / q ). [Nowozin-Cseke-Tomioka’16] f -GAN minimizes the variational lower bound (9) " # n X 1 b T ( X i ) � E Q f ⇤ ( T ( X )) P = arg min sup . (10) n Q 2 Q T 2 T i =1 with i.i.d. observations X 1 , ..., X n ⇠ P .

  26. From f-GAN to Tukey’s Median: f-learning Consider the special case ⇢ ✓ e ◆ � q q 2 e f 0 T = : e Q (11) . q which is tight if P 2 e Q . The sample version leads to the following f -learning " ◆◆# ✓ e ◆ ✓ ✓ e n X 1 q ( X i ) q ( X ) b f 0 � E Q f ⇤ f 0 P = arg min sup (12) q ( X i ) q ( X ) . n Q 2 Q Q 2 e e Q i =1 • If f ( x ) = x log x , Q = e Q , (12) ) Maximum Likelihood Estimate R • If f ( x ) = ( x � 1)+, then D f ( P k Q ) = 1 | p � q | is the TV-distance, 2 f ⇤ ( t ) = t I { 0  t  1 } , f -GAN ) TV-GAN r ! 0 • Q = { N ( η , I p ) : η 2 R p } and e Q = { N ( e η , I p ) : k e η � η k  r } , (12) ) Tukey’s Median

  27. f-Learning

  28. f-Learning ✓ p ◆ Z f-divergence D f ( P k Q ) = f dQ. q

  29. f-Learning ✓ p ◆ Z f-divergence D f ( P k Q ) = f dQ. q ( tu � f ⇤ ( t )) , f ( u ) = sup t

  30. f-Learning ✓ p ◆ Z f-divergence D f ( P k Q ) = f dQ. q variational [ E X ⇠ P T ( X ) � E X ⇠ Q f ⇤ ( T ( X ))] ) = sup representation T

  31. f-Learning ✓ p ◆ Z f-divergence D f ( P k Q ) = f dQ. q variational [ E X ⇠ P T ( X ) � E X ⇠ Q f ⇤ ( T ( X ))] ) = sup representation T ✓ p ( x ) ◆ T ( x ) = f 0 optimal T q ( x )

  32. f-Learning ✓ p ◆ Z f-divergence D f ( P k Q ) = f dQ. q variational [ E X ⇠ P T ( X ) � E X ⇠ Q f ⇤ ( T ( X ))] ) = sup representation T ( ! !!) d ˜ d ˜ Q ( X ) Q ( X ) E X ⇠ P f 0 � E X ⇠ Q f ⇤ f 0 = sup dQ ( X ) dQ ( X ) ˜ Q

  33. f-Learning ( ) Z n X 1 f ⇤ ( T ) dQ f-GAN Q 2 Q max min T ( X i ) � n T 2 T i =1 ( ) ✓ ˜ ◆ ✓ ✓ ˜ ◆◆ Z n X 1 q ( X i ) q f 0 f ⇤ f 0 Q 2 Q max min � dQ , f-Learning q ( X i ) n q Q 2 ˜ ˜ Q i =1

  34. f-Learning ( ) Z n X 1 f ⇤ ( T ) dQ f-GAN Q 2 Q max min T ( X i ) � n T 2 T i =1 ( ) ✓ ˜ ◆ ✓ ✓ ˜ ◆◆ Z n X 1 q ( X i ) q f 0 f ⇤ f 0 Q 2 Q max min � dQ , f-Learning q ( X i ) n q Q 2 ˜ ˜ Q i =1

  35. f-Learning ( ) Z n X 1 f ⇤ ( T ) dQ f-GAN Q 2 Q max min T ( X i ) � n T 2 T i =1 ( ) ✓ ˜ ◆ ✓ ✓ ˜ ◆◆ Z n X 1 q ( X i ) q f 0 f ⇤ f 0 Q 2 Q max min � dQ , f-Learning q ( X i ) n q Q 2 ˜ ˜ Q i =1

  36. f-Learning ( ) Z n X 1 f ⇤ ( T ) dQ f-GAN Q 2 Q max min T ( X i ) � n T 2 T i =1 ( ) ✓ ˜ ◆ ✓ ✓ ˜ ◆◆ Z n X 1 q ( X i ) q f 0 f ⇤ f 0 Q 2 Q max min � dQ , f-Learning q ( X i ) n q Q 2 ˜ ˜ Q i =1 [Nowozin, Cseke, Tomioka]

  37. f-Learning Jensen-Shannon GAN f ( x ) = x log x � ( x + 1) log( x + 1) . Kullback-Leibler f ( x ) = x log x MLE f ( x ) = 2 � 2 p x , Hellinger Squared rho Total Variation depth f ( x ) = ( x � 1) + [Goodfellow et al., Baraud and Birge]

  38. f-Learning Jensen-Shannon GAN f ( x ) = x log x � ( x + 1) log( x + 1) . Kullback-Leibler f ( x ) = x log x MLE f ( x ) = 2 � 2 p x , Hellinger Squared rho Total Variation depth f ( x ) = ( x � 1) + [Goodfellow et al., Baraud and Birge]

  39. f-Learning Jensen-Shannon GAN f ( x ) = x log x � ( x + 1) log( x + 1) . Kullback-Leibler f ( x ) = x log x MLE f ( x ) = 2 � 2 p x , Hellinger Squared rho Total Variation depth f ( x ) = ( x � 1) + [Goodfellow et al., Baraud and Birge]

  40. f-Learning Jensen-Shannon GAN f ( x ) = x log x � ( x + 1) log( x + 1) . Kullback-Leibler f ( x ) = x log x MLE f ( x ) = 2 � 2 p x , Hellinger Squared rho Total Variation depth f ( x ) = ( x � 1) + [Goodfellow et al., Baraud and Birge]

  41. f-Learning Jensen-Shannon GAN f ( x ) = x log x � ( x + 1) log( x + 1) . Kullback-Leibler f ( x ) = x log x MLE f ( x ) = 2 � 2 p x , Hellinger Squared rho Total Variation depth f ( x ) = ( x � 1) + [Goodfellow et al., Baraud and Birge]

  42. TV-Learning ( ◆) ⇢ ˜ � ✓ ˜ n X 1 q ( X i ) q Q 2 Q max min q ( X i ) � 1 q � 1 � Q I n Q 2 ˜ ˜ Q i =1

  43. TV-Learning ( ◆) ⇢ ˜ � ✓ ˜ n X 1 q ( X i ) q Q 2 Q max min q ( X i ) � 1 q � 1 � Q I n Q 2 ˜ ˜ Q i =1 n o n N ( ✓ , I p ) : ✓ 2 R p o ˜ N (˜ ✓ , I p ) : ˜ Q = ✓ 2 N r ( ✓ ) Q =

  44. TV-Learning ( ◆) ⇢ ˜ � ✓ ˜ n X 1 q ( X i ) q Q 2 Q max min q ( X i ) � 1 q � 1 � Q I n Q 2 ˜ ˜ Q i =1 n o n N ( ✓ , I p ) : ✓ 2 R p o ˜ N (˜ ✓ , I p ) : ˜ Q = ✓ 2 N r ( ✓ ) Q = r ! 0

  45. TV-Learning ( ◆) ⇢ ˜ � ✓ ˜ n X 1 q ( X i ) q Q 2 Q max min q ( X i ) � 1 q � 1 � Q I n Q 2 ˜ ˜ Q i =1 n o n N ( ✓ , I p ) : ✓ 2 R p o ˜ N (˜ ✓ , I p ) : ˜ Q = ✓ 2 N r ( ✓ ) Q = r ! 0 n 1 u T X i � u T θ X � max θ 2 R min I Tukey depth n k u k =1 i =1

  46. TV-Learning ( ◆) ⇢ ˜ � ✓ ˜ n X 1 q ( X i ) q Q 2 Q max min q ( X i ) � 1 q � 1 � Q I n Q 2 ˜ ˜ Q i =1

  47. TV-Learning ( ◆) ⇢ ˜ � ✓ ˜ n X 1 q ( X i ) q Q 2 Q max min q ( X i ) � 1 q � 1 � Q I n Q 2 ˜ ˜ Q i =1 n N (0 , Σ ) : Σ 2 R p ⇥ p o n o Σ = Σ + ruu T , k u k = 1 ˜ N (0 , ˜ Σ ) : ˜ Q = Q =

  48. TV-Learning ( ◆) ⇢ ˜ � ✓ ˜ n X 1 q ( X i ) q Q 2 Q max min q ( X i ) � 1 q � 1 � Q I n Q 2 ˜ ˜ Q i =1 n N (0 , Σ ) : Σ 2 R p ⇥ p o n o Σ = Σ + ruu T , k u k = 1 ˜ N (0 , ˜ Σ ) : ˜ Q = Q = r ! 0

  49. TV-Learning ( ◆) ⇢ ˜ � ✓ ˜ n X 1 q ( X i ) q Q 2 Q max min q ( X i ) � 1 q � 1 � Q I n Q 2 ˜ ˜ Q i =1 n N (0 , Σ ) : Σ 2 R p ⇥ p o n o Σ = Σ + ruu T , k u k = 1 ˜ N (0 , ˜ Σ ) : ˜ Q = Q = r ! 0 (related to) matrix depth " ! !# n n X X 1 1 I {| u T X i | 2  u T Σ u } � P ( � 2 I {| u T X i | 2 > u T Σ u } � P ( � 2 max min 1  1) ^ 1 > 1) n n Σ k u k =1 i =1 i =1

  50. robust deep statistics learning community community

  51. robust deep f-Learning statistics learning f-GAN community community

  52. robust deep f-Learning statistics learning f-GAN community community practically good algorithms

  53. theoretical foundation robust deep f-Learning statistics learning f-GAN community community practically good algorithms

  54. TV-GAN " # n X 1 1 1 b ✓ = argmin sup 1 + e − w T X i − b � E η 1 + e − w T X − b n η w,b i =1

  55. TV-GAN " # n X 1 1 1 b ✓ = argmin sup 1 + e − w T X i − b � E η 1 + e − w T X − b n η w,b i =1 N ( ⌘ , I p )

  56. TV-GAN " # n X 1 1 1 b ✓ = argmin sup 1 + e − w T X i − b � E η 1 + e − w T X − b n η w,b i =1 N ( ⌘ , I p ) logistic regression classifier

Recommend


More recommend