Advance Stochastic Gradient with Variance Reduction Jingchang Liu December 7, 2017 University of Science and Technology of China 1
Table of Contents Introductions Control Variates Antithetic Sampling Stratified Sampling Important Sampling Experiments Conclusions Q & A 2
Introductions
Formulations Optimization problems n f ( w ) := 1 � min f ( w ) , f i ( w ) n i =1 Stochastic gradient descent At each iteration t = 1 , 2 , · · · , draw i t randomly from { 1 , · · · , n } w t = w t − η t ∇ f i t w t � � Unified formulation ζ is a random variable. w t +1 = w t − η t g ( w t , ζ t ) 3
Estimation Stochastic gradient n ∇ f i t ( w t ) → 1 � f i ( w t ) n i =1 Unbiased n = 1 ∇ f i t ( w t ) � f i ( w t ) � � E n i =1 Variance Reduce(VR) control variates antithetic variates important sampling stratified sampling 4
Control Variates
Control variates Introduction Unknown parameter µ , assume we have a static X : E X = µ , another r.v. Y , such that E Y = τ is a known value, define a new r.v. ¯ X = X + c ( Y − τ ) Properties • Unbias: E ¯ X = E X = µ • Variance: Var ( ¯ X ) = Var ( X ) + c 2 Var ( Y ) + 2 cCov ( X , Y ) Optimal coefficient: c ∗ = − Cov ( X , Y ) Var ( Y ) • Simply: • ¯ X = X − Y + τ ,if cov ( X , Y ) > 0 • ¯ X = X + Y − τ ,if cov ( X , Y ) < 0 5
Control variates for stochastic gradient VR gradient • Former: v k = ∇ f i k ( w k − 1 ) • Case 1: v k = ∇ f i k ( w k − 1 ) − ∇ h i k ( w k − 1 ) + E ∇ h i k ( w k − 1 ) • Case 2: v k = ∇ f i k ( w k − 1 ) − ∇ f i k ( ˜ w ) + ˜ v Methods • SAGA: ∇ f i k ( ˜ w ) is stored in the table. • SVRG: ∇ f i k ( ˜ w ) is calculated after a specific number of iterations. k → 0 E � v k � 2 = 0 • lim • SAGA. SVRG will convergence under fixed stepsize. 6
Antithetic Sampling
antithetic variates Two r.v. X i , X j id, E X i = µ, E X j = µ . � 1 = µ use 1 � As E 2 ( X i + X j ) 2 ( X i + X j ) to estimate µ Formulations • if X and Y are independent, Var (1 1 4 Var ( X i + X j ) = 1 2( X i + X j )) = 4 { Var ( X i ) + Var ( X j ) } 1 4 × 2 Var ( X i ) = 1 = 2 Var ( X i ) • if X and Y are negative correlation, Var (1 2( X i + X j )) = 1 4 { Var ( X i )+ Var ( X j )+2 Cov ( X i , X j ) } ≤ 1 2 Var ( X i ) • if X j = 2 µ − X i , then Var ( 1 2 ( X i + X j )) = Var ( µ ) = 0 7
antithetic variates for stochastic gradient logistic regression ′ e − y i · x i w ′ ∇ f i ( w ) = i w y i x i 1 + e − y i · x ′ Formulations E �∇ f i ( w ) + ∇ f j ( w ) � 2 = E �∇ f i ( w ) � 2 + E �∇ f j ( w ) � 2 +2 E �∇ f i ( w ) , ∇ f j ( w ) � ′ � ′ � e − y i · x j w e − y i · x i w ′ ′ E �∇ f i ( w ) , ∇ f j ( w ) � = E i w y i x i , j w y j x j 1 + e − y i · x ′ 1 + e − y j · x ′ ′ � ′ � � � e − y j · x j w e − y i · x i w � � � � ′ ′ ≥ − E i w y i x j w y j x � � � � i j 1 + e − y i · x ′ 1 + e − y j · x ′ � � � � � � � � ′ ′ if and only if y i x i � y j x j , equal hold. 8
SDCA Derivation n 2 � w � 2 equals to f ( w ) = 1 f i ( w ) + λ � n i =1 n 1 f i ( z i ) + λ 2 � y � 2 � P ( y , z ) = n i =1 s . t . y = z i , i = 1 , 2 , · · · , n n L ( y , z , α ) = P ( y , z ) + 1 � α i ( y − z i ) n i =1 D ( α ) = inf y , z L ( y , z , α ) n � n � 1 λ 2 � y � 2 + 1 � � = inf z i { f i ( z i ) − α i z i } + inf α i y n n y i =1 i =1 2 � � n n 1 i ( − α i ) − λ 1 � � � � − f ∗ = α i � � n 2 � λ n � � � i =1 i =1 9
SDCA Formulation and relationships n min f ( w ) = 1 � f i ( w ) + 0 . 5 λ w ′ w n i =1 n i = − 1 w t = � α ∗ λ n ∇ f i ( w ∗ ) α t i i =1 Update � α t − 1 − η t ( ∇ f i ( w t − 1 ) + λ n α t − 1 ) l = i α t l l l = α t − 1 l � = i l w t − 1 + w t α t i − α t − 1 � � = i w t − 1 − η t ( ∇ f i ( w t − 1 ) + λ n α t − 1 = ) l λ n α t − 1 is antithetic to ∇ f i ( w t − 1 ), ∇ f i ( w t − 1 ) + λ n α t − 1 → 0 as t → inf l l 10
Stratified Sampling
Stratified sampling Figure 1: Stratified sampling Group 1 Group 2 Group 3 Group 4 ∇ f 11 , · · · , ∇ f 1 n 1 ∇ f 21 , · · · , ∇ f 2 n 2 · · · ∇ f L 1 , · · · , ∇ f Ln L 11
Stratified sampling Principles • homogenous within-groups. • heterogenous between the groups. Stratified sample size b h b = n h • Proportional: n = W h W h S h N h S h • Neyman: b h = b = b L L � W h S h � N h S h h =1 h =1 Apply to stochastic gradient • for the same labels y , cluster x , to stratify. • ( x i , y i ) → ∇ f i ( w ; x i , y i ) 12
Important Sampling
Important Sampling n 1 • Uniform sampling: ∇ f ( w t ) = � n ∇ f i ( w t ) i =1 n ∇ f i ( w ) p t • Important sampling: ∇ f ( w t ) = � i , np t i i =1 � n i =1 p t i = 1 t = 1 , 2 , · · · Figure 2: Important sampling 13
Important Sampling for Stochastic Gradient � n � 2 2 n �∇ f i ( w t ) � 2 ∇ f i t ( w t ) � � 1 ≥ 1 � � � ∇ f i ( w t ) � � � � min = min p t E � � � np t n 2 p t n 2 p t � � i t i i =1 i =1 �∇ f i ( w t ) � p t i = � n j =1 �∇ f j ( w t ) � if f i ( w ) is L i -Lipschitz, then �∇ f i ( w ) � ≤ L i , L i p t i = � n j =1 L j 14
Experiments
Stratified Sampling Figure 3: multi-class logistic regression (convex) on letter, mnist, pendigits, and usps. 15
Important sampling Figure 4: SVM on several datasets 16
Conclusions
conclusions • VR base on optimize variables, such as SDCA. SVRG, can make the variance convergence to 0. • VR base on samples, can significantly reduce the variance. • Constructing related variates is crucial. • Different VR methods can be combined, but how to need our efforts. 17
Q & A
Recommend
More recommend