Robust Estimation and Generative Adversarial Networks Weizhi ZHU Hong Kong University of Science and Technology wzhuai@ust.hk April 3, 2019 Robust Estimation and Generative Adversarial Nets [GLYZ18] Generative Adversarial Nets for Robust Scatter Estimation: A Proper Scoring Rule Perspective [GYZ19] Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 1 / 24
Huber’s Contamination Model Huber’s contamination model [Huber, 1964] , P = (1 − ǫ ) P θ + ǫ Q . Strong contamination model [Diakonikolas et al., 2016a] , TV ( P , P θ ) ≤ ǫ. Can we recover θ by data drawn from P with arbitrary unknown contamination ( ǫ, Q )? Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 2 / 24
Example: Robust Mean Estimation Let’s firstly consider the robust estimation of location parameter θ in normal distribution, X 1 , . . . , X n ∼ (1 − ǫ ) N ( θ, I p ) + ǫ Q Coordinate-wise median. Tukey median [Tukey, 1978] . � � � � n n � � � u T X i > u T η u T X i ≤ u T η θ = argmax min 1 ∧ 1 � u � 2 =1 η ∈ R p i =1 i =1 Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 3 / 24
Comparison Median Tukey Median statistical convergence rate p p n n (no contamination) statistical convergence rate p p n ∨ p ǫ 2 n ∨ ǫ 2 , [minimax] (Huber’s ǫ contamination) computational complexity Polynomial NP-Hard Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 4 / 24
Example: Robust Covariance Estimation We can also estimate the covariance matrix Σ in normal distribution, X 1 , . . . , X n ∼ (1 − ǫ ) N (0 , Σ) + ǫ Q Covariance depth [Chen-Gao-Ren, 2017] . � n � n � � � � | u T X i | 2 > u T Γ u | u T X i | 2 ≤ u T Γ u � ∧ Γ = argmax min 1 1 , � u � 2 =1 Γ > 0 i =1 i =1 (1) � � � � Γ = 3 � Σ = β , P N (0 , 1) < β 4 . � � Σ − Σ � op ≤ C ( p n + ǫ 2 ) with high probability uniformly over Σ and Q . Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 5 / 24
Computational Complexity Polynomial algorithms are proposed [Lai et al., 2016; Diakonikolas et al., 2018] of nearly minimax optimal statistical precision. - Prior knowledge on ǫ . - Needs some moment constraints. Advantages of the depth estimation. - Does not need prior knowledge on ǫ . - Adaptive to any elliptical distributions. - A well defined objective function. - Any feasible algorithms in practice? Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 6 / 24
f -divergence Given a convex function f satisfying f (1) = 0, the f -divergence of P from Q is defined as, � � dP � D f ( P � Q ) = f dQ (2) dQ Let f ∗ be the convex conjugate of f , then a variational lower bound of (2) is given by, � � � t p ( x ) q ( x ) − f ∗ ( t ) D f ( P � Q ) = q ( x ) sup dx , t ∈ dom f ∗ E x ∼ P [ T ( x )] − E x ∼ Q [ f ∗ ( T ( x ))] . ≥ sup (3) T ∈T The equality holds in (3) if f ′ � � p ∈ T . q � ˜ � � � � ˜ ��� n � 1 q ( X i ) q ( X i ) f ′ f ∗ f ′ D f ( P � Q ) ≥ max − E X ∼ Q . n q ( X i ) q ( X i ) Q ∈ ˜ ˜ Q i =1 (4) Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 7 / 24
f -GAN and f -Learning f -Learning. Let ˜ Q be a distribution family, � ˜ � � � � ˜ ��� � n 1 q ( X i ) q ( X i ) � f ′ f ∗ f ′ P = argmin max − E X ∼ Q . n q ( X i ) q ( X i ) Q ∈ ˜ ˜ Q ∈Q Q i =1 f -GAN [Nowozin et al., 2016] , n � 1 � T ( X i ) − E X ∼ Q [ f ∗ ( T ( x ))] , P = argmin max n T ∈T Q ∈Q i =1 where T is usually parametrized by a neural network. - f -GAN can smooth f -Learning’s objective function. - f-divergence is robust. - There exist practical efficient algorithms to solve. Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 8 / 24
Example f(x) = x log x (KL-divergence), p ∈ ˜ Q (or f ′ ( p / q ) ∈ T ), then KL-Learning (or KL-GAN) becomes maximal likelihood estimate. f(x) = x log x − ( x + 1) log 1+ x (JS-divergence), which leads to the 2 original JS-GAN [Goodfellow et al., 2014] , � n 1 � P = argmin max log (sigmoid ( T ( X i )))+ E x ∼ Q log (1 − sigmoid ( T ( x ))) . n T ∈T Q ∈Q i =1 Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 9 / 24
Example (Continued) f(x) = ( x − 1) + (TV-divergence) and f ∗ ( t ) = t , 0 ≤ t ≤ 1 . - When taking Q = {N ( θ, I p ) : θ ∈ R p } , Q ( θ, r ) = {N (˜ ˜ θ, I p ) : � ˜ θ − θ � 2 ≤ r } . TV-Learning is defined as, � ˜ � � ˜ � n � 1 q ( X i ) q min max 1 q ( X i ) ≥ 1 − Q q ≥ 1 n Q ∈Q Q ∈ ˜ ˜ Q ( θ, r ) i =1 r → 0 - TV-Learning → Tukey median, � � � n u T X i > u T η max η ∈ R p min � u � 2 =1 i =1 1 . - With T parameterized by the class of neural networks, TV-GAN is defined as, � n 1 � P = argmin max sigmoid ( T ( X i )) − E x ∼ Q [sigmoid ( T ( x ))] . n T ∈T Q ∈Q i =1 Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 10 / 24
Proper Scoring Rule { S ( · , 1) , S ( · , 0) } is the forecaster’s reward if a player quotes t when event 1 or 0 occurs. S ( t ; p ) = pS ( t , 1) + (1 − p ) S ( t , 0) is the expected reward when the event occurs with probability p . { S ( · , 1) , S ( · , 0) } is a proper scoring rule if S ( p ; p ) ≥ S ( t ; p ) , ∀ t ∈ [0 , 1] . (Savage representation) S is proper iff there exists a convex function G ( · ) such that, � S ( t , 1) = G ( t ) + (1 − t ) G ′ ( t ) , S ( t , 0) = G ( t ) − tG ′ ( t ) . Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 11 / 24
Proper Scoring Rule and f-divergence We consider a natural cost function with assumption X | y = 1 ∼ P and X | y = 0 ∼ Q with prior P ( y = 1) = 1 / 2, that is, 1 1 2 S ( T ( X ) , 1) + E X ∼ Q 2 S ( T ( X ) , 0) . E X ∼ P Then one can find a good classification rule T ( · ) by maximizing the above objective over T ∈ T , � 1 � 2 E X ∼ P S ( T ( X ) , 1) + 1 2 E X ∼ Q S ( T ( X ) , 0) − G (1 D T ( P , Q ) = max 2) T ∈T Log Score (JS-divergence). S ( t , 1) = log t , S ( t , 0) = log(1 − t ) Zero-One Score (TV-divergence). S ( t , 1) = I { t ≥ 1 / 2 } , S ( t , 0) = I { t < 1 / 2 } . Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 12 / 24
(Multi-layers) JS-GAN is Statistical Optimal � � n � 1 � θ = argmin max log T ( X i ) + E N ( η, I p ) log(1 − T ( X i )) + log 4 , n T ∈T η ∈ R p i =1 Theorem (Gao-Liu-Yao-Zhu’ 2018) With i.i.d. observations X 1 , ..., X n ∼ (1 − ǫ ) N ( θ, I p ) + ǫ Q and some regularizations on weight matrix, we have � p n ∨ ǫ 2 , at least one bounded activation θ − θ � 2 � � � p log p ∨ ǫ 2 , ReLU n with high probability uniformly over all θ ∈ R p and all Q. It can be generalized to elliptical distribution µ + Σ 1 / 2 ξ U and the strong contamination model. Covariance and mean can be estimated simultaneously. Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 13 / 24
Proof Sketch �� � � p log(1 /δ ) sup D ∈D | E P n D ( X ) − E P D ( X ) | ≤ C n + . n �� � � log(1 /δ ) p sup D ∈D | E P θ D ( X ) − E P ˆ θ ( D ( X )) | ≤ 2 C n + + 2 ǫ. n | f ( t ) − f (0) | ≥ c ′ | t | , | t | < τ for some τ > 0, where f ( t ) = E N (0 , 1) (sigmoid( z − t )) satisfies, � w � 2 =1 , b = − w T θ θ D ( X ) = f ( w T ( θ − ˆ E P θ D ( X ) = = = = = = = = = = = = f (0) , E P ˆ θ )) . Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 14 / 24
Covariance Matrix Estimation: Improper Network Structure � � w j sigmoid( u T : | w j | ≤ κ, u j ∈ R p T 1 = T ( x ) = sigmoid j x ) . j ≥ 1 j ≥ 1 � � w j ReLU( u T : T 2 = T ( x ) = sigmoid j x ) | w j | ≤ κ, � u j � ≤ 1 . j ≥ 1 j ≥ 1 Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 15 / 24
Covariance Matrix Estimation: Proper Network Structure � � w j sigmoid( u T : T 3 = T ( x ) = sigmoid j x + b j ) j ≥ 1 � � | w j | ≤ κ, u j ∈ R p , b j ∈ R . j ≥ 1 � � � H � � v jl ReLU( u T : T 4 = T ( x ) = sigmoid w j sigmoid l x ) j ≥ 1 l =1 � H � � | w j | ≤ κ 1 , | v jl | ≤ κ 2 , � u l � ≤ 1 . j ≥ 1 l =1 Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 16 / 24
� � � n 1 � Σ = argmin max S ( T ( X i ) , 1) + E X ∼ N (0 , Γ) S ( T ( X ) , 0) n T ∈T Γ ∈E p ( M ) i =1 Theorem (Gao-Yao-Zhu’ 2019) With i.i.d. observations X 1 , ..., X n ∼ (1 − ǫ ) N (0 , Σ) + ǫ Q and some regularizations on network weight matrix, we have op � p � � Σ − Σ � 2 n ∨ ǫ 2 with high probability uniformly over all � Σ � op ≤ M = O (1) and all Q. Weizhi ZHU (HKUST) Robust Estimation and GANs April 3, 2019 17 / 24
Recommend
More recommend