advance stochastic gradient with variance reduction
play

Advance Stochastic Gradient with Variance Reduction Jingchang Liu - PowerPoint PPT Presentation

Advance Stochastic Gradient with Variance Reduction Jingchang Liu December 7, 2017 University of Science and Technology of China 1 Table of Contents Introductions Control Variates Antithetic Sampling Stratified Sampling Important Sampling


  1. Advance Stochastic Gradient with Variance Reduction Jingchang Liu December 7, 2017 University of Science and Technology of China 1

  2. Table of Contents Introductions Control Variates Antithetic Sampling Stratified Sampling Important Sampling Experiments Conclusions Q & A 2

  3. Introductions

  4. Formulations Optimization problems n f ( w ) := 1 � min f ( w ) , f i ( w ) n i =1 Stochastic gradient descent At each iteration t = 1 , 2 , · · · , draw i t randomly from { 1 , · · · , n } w t = w t − η t ∇ f i t w t � � Unified formulation ζ is a random variable. w t +1 = w t − η t g ( w t , ζ t ) 3

  5. Estimation Stochastic gradient n ∇ f i t ( w t ) → 1 � f i ( w t ) n i =1 Unbiased n = 1 ∇ f i t ( w t ) � f i ( w t ) � � E n i =1 Variance Reduce(VR) control variates antithetic variates important sampling stratified sampling 4

  6. Control Variates

  7. Control variates Introduction Unknown parameter µ , assume we have a static X : E X = µ , another r.v. Y , such that E Y = τ is a known value, define a new r.v. ¯ X = X + c ( Y − τ ) Properties • Unbias: E ¯ X = E X = µ • Variance: Var ( ¯ X ) = Var ( X ) + c 2 Var ( Y ) + 2 cCov ( X , Y ) Optimal coefficient: c ∗ = − Cov ( X , Y ) Var ( Y ) • Simply: • ¯ X = X − Y + τ ,if cov ( X , Y ) > 0 • ¯ X = X + Y − τ ,if cov ( X , Y ) < 0 5

  8. Control variates for stochastic gradient VR gradient • Former: v k = ∇ f i k ( w k − 1 ) • Case 1: v k = ∇ f i k ( w k − 1 ) − ∇ h i k ( w k − 1 ) + E ∇ h i k ( w k − 1 ) • Case 2: v k = ∇ f i k ( w k − 1 ) − ∇ f i k ( ˜ w ) + ˜ v Methods • SAGA: ∇ f i k ( ˜ w ) is stored in the table. • SVRG: ∇ f i k ( ˜ w ) is calculated after a specific number of iterations. k → 0 E � v k � 2 = 0 • lim • SAGA. SVRG will convergence under fixed stepsize. 6

  9. Antithetic Sampling

  10. antithetic variates Two r.v. X i , X j id, E X i = µ, E X j = µ . � 1 = µ use 1 � As E 2 ( X i + X j ) 2 ( X i + X j ) to estimate µ Formulations • if X and Y are independent, Var (1 1 4 Var ( X i + X j ) = 1 2( X i + X j )) = 4 { Var ( X i ) + Var ( X j ) } 1 4 × 2 Var ( X i ) = 1 = 2 Var ( X i ) • if X and Y are negative correlation, Var (1 2( X i + X j )) = 1 4 { Var ( X i )+ Var ( X j )+2 Cov ( X i , X j ) } ≤ 1 2 Var ( X i ) • if X j = 2 µ − X i , then Var ( 1 2 ( X i + X j )) = Var ( µ ) = 0 7

  11. antithetic variates for stochastic gradient logistic regression ′ e − y i · x i w ′ ∇ f i ( w ) = i w y i x i 1 + e − y i · x ′ Formulations E �∇ f i ( w ) + ∇ f j ( w ) � 2 = E �∇ f i ( w ) � 2 + E �∇ f j ( w ) � 2 +2 E �∇ f i ( w ) , ∇ f j ( w ) � ′ � ′ � e − y i · x j w e − y i · x i w ′ ′ E �∇ f i ( w ) , ∇ f j ( w ) � = E i w y i x i , j w y j x j 1 + e − y i · x ′ 1 + e − y j · x ′ ′ � ′ � � � e − y j · x j w e − y i · x i w � � � � ′ ′ ≥ − E i w y i x j w y j x � � � � i j 1 + e − y i · x ′ 1 + e − y j · x ′ � � � � � � � � ′ ′ if and only if y i x i � y j x j , equal hold. 8

  12. SDCA Derivation n 2 � w � 2 equals to f ( w ) = 1 f i ( w ) + λ � n i =1 n 1 f i ( z i ) + λ 2 � y � 2 � P ( y , z ) = n i =1 s . t . y = z i , i = 1 , 2 , · · · , n n L ( y , z , α ) = P ( y , z ) + 1 � α i ( y − z i ) n i =1 D ( α ) = inf y , z L ( y , z , α ) n � n � 1 λ 2 � y � 2 + 1 � � = inf z i { f i ( z i ) − α i z i } + inf α i y n n y i =1 i =1 2 � � n n 1 i ( − α i ) − λ 1 � � � � − f ∗ = α i � � n 2 � λ n � � � i =1 i =1 9

  13. SDCA Formulation and relationships n min f ( w ) = 1 � f i ( w ) + 0 . 5 λ w ′ w n i =1 n i = − 1 w t = � α ∗ λ n ∇ f i ( w ∗ ) α t i i =1 Update � α t − 1 − η t ( ∇ f i ( w t − 1 ) + λ n α t − 1 ) l = i α t l l l = α t − 1 l � = i l w t − 1 + w t α t i − α t − 1 � � = i w t − 1 − η t ( ∇ f i ( w t − 1 ) + λ n α t − 1 = ) l λ n α t − 1 is antithetic to ∇ f i ( w t − 1 ), ∇ f i ( w t − 1 ) + λ n α t − 1 → 0 as t → inf l l 10

  14. Stratified Sampling

  15. Stratified sampling Figure 1: Stratified sampling Group 1 Group 2 Group 3 Group 4 ∇ f 11 , · · · , ∇ f 1 n 1 ∇ f 21 , · · · , ∇ f 2 n 2 · · · ∇ f L 1 , · · · , ∇ f Ln L 11

  16. Stratified sampling Principles • homogenous within-groups. • heterogenous between the groups. Stratified sample size b h b = n h • Proportional: n = W h W h S h N h S h • Neyman: b h = b = b L L � W h S h � N h S h h =1 h =1 Apply to stochastic gradient • for the same labels y , cluster x , to stratify. • ( x i , y i ) → ∇ f i ( w ; x i , y i ) 12

  17. Important Sampling

  18. Important Sampling n 1 • Uniform sampling: ∇ f ( w t ) = � n ∇ f i ( w t ) i =1 n ∇ f i ( w ) p t • Important sampling: ∇ f ( w t ) = � i , np t i i =1 � n i =1 p t i = 1 t = 1 , 2 , · · · Figure 2: Important sampling 13

  19. Important Sampling for Stochastic Gradient � n � 2 2 n �∇ f i ( w t ) � 2 ∇ f i t ( w t ) � � 1 ≥ 1 � � � ∇ f i ( w t ) � � � � min = min p t E � � � np t n 2 p t n 2 p t � � i t i i =1 i =1 �∇ f i ( w t ) � p t i = � n j =1 �∇ f j ( w t ) � if f i ( w ) is L i -Lipschitz, then �∇ f i ( w ) � ≤ L i , L i p t i = � n j =1 L j 14

  20. Experiments

  21. Stratified Sampling Figure 3: multi-class logistic regression (convex) on letter, mnist, pendigits, and usps. 15

  22. Important sampling Figure 4: SVM on several datasets 16

  23. Conclusions

  24. conclusions • VR base on optimize variables, such as SDCA. SVRG, can make the variance convergence to 0. • VR base on samples, can significantly reduce the variance. • Constructing related variates is crucial. • Different VR methods can be combined, but how to need our efforts. 17

  25. Q & A

Recommend


More recommend