Advance Stochastic Gradient with Variance Reduction Jingchang Liu - PowerPoint PPT Presentation

Advance Stochastic Gradient with Variance Reduction Jingchang Liu December 7, 2017 University of Science and Technology of China 1

Table of Contents Introductions Control Variates Antithetic Sampling Stratified Sampling Important Sampling Experiments Conclusions Q & A 2

Introductions

Formulations Optimization problems n f ( w ) := 1 � min f ( w ) , f i ( w ) n i =1 Stochastic gradient descent At each iteration t = 1 , 2 , · · · , draw i t randomly from { 1 , · · · , n } w t = w t − η t ∇ f i t w t � � Unified formulation ζ is a random variable. w t +1 = w t − η t g ( w t , ζ t ) 3

Estimation Stochastic gradient n ∇ f i t ( w t ) → 1 � f i ( w t ) n i =1 Unbiased n = 1 ∇ f i t ( w t ) � f i ( w t ) � � E n i =1 Variance Reduce(VR) control variates antithetic variates important sampling stratified sampling 4

Control Variates

Control variates Introduction Unknown parameter µ , assume we have a static X : E X = µ , another r.v. Y , such that E Y = τ is a known value, define a new r.v. ¯ X = X + c ( Y − τ ) Properties • Unbias: E ¯ X = E X = µ • Variance: Var ( ¯ X ) = Var ( X ) + c 2 Var ( Y ) + 2 cCov ( X , Y ) Optimal coefficient: c ∗ = − Cov ( X , Y ) Var ( Y ) • Simply: • ¯ X = X − Y + τ ,if cov ( X , Y ) > 0 • ¯ X = X + Y − τ ,if cov ( X , Y ) < 0 5

Control variates for stochastic gradient VR gradient • Former: v k = ∇ f i k ( w k − 1 ) • Case 1: v k = ∇ f i k ( w k − 1 ) − ∇ h i k ( w k − 1 ) + E ∇ h i k ( w k − 1 ) • Case 2: v k = ∇ f i k ( w k − 1 ) − ∇ f i k ( ˜ w ) + ˜ v Methods • SAGA: ∇ f i k ( ˜ w ) is stored in the table. • SVRG: ∇ f i k ( ˜ w ) is calculated after a specific number of iterations. k → 0 E � v k � 2 = 0 • lim • SAGA. SVRG will convergence under fixed stepsize. 6

Antithetic Sampling

antithetic variates Two r.v. X i , X j id, E X i = µ, E X j = µ . � 1 = µ use 1 � As E 2 ( X i + X j ) 2 ( X i + X j ) to estimate µ Formulations • if X and Y are independent, Var (1 1 4 Var ( X i + X j ) = 1 2( X i + X j )) = 4 { Var ( X i ) + Var ( X j ) } 1 4 × 2 Var ( X i ) = 1 = 2 Var ( X i ) • if X and Y are negative correlation, Var (1 2( X i + X j )) = 1 4 { Var ( X i )+ Var ( X j )+2 Cov ( X i , X j ) } ≤ 1 2 Var ( X i ) • if X j = 2 µ − X i , then Var ( 1 2 ( X i + X j )) = Var ( µ ) = 0 7

antithetic variates for stochastic gradient logistic regression ′ e − y i · x i w ′ ∇ f i ( w ) = i w y i x i 1 + e − y i · x ′ Formulations E �∇ f i ( w ) + ∇ f j ( w ) � 2 = E �∇ f i ( w ) � 2 + E �∇ f j ( w ) � 2 +2 E �∇ f i ( w ) , ∇ f j ( w ) � ′ � ′ � e − y i · x j w e − y i · x i w ′ ′ E �∇ f i ( w ) , ∇ f j ( w ) � = E i w y i x i , j w y j x j 1 + e − y i · x ′ 1 + e − y j · x ′ ′ � ′ � � � e − y j · x j w e − y i · x i w � � � � ′ ′ ≥ − E i w y i x j w y j x � � � � i j 1 + e − y i · x ′ 1 + e − y j · x ′ � � � � � � � � ′ ′ if and only if y i x i � y j x j , equal hold. 8

SDCA Derivation n 2 � w � 2 equals to f ( w ) = 1 f i ( w ) + λ � n i =1 n 1 f i ( z i ) + λ 2 � y � 2 � P ( y , z ) = n i =1 s . t . y = z i , i = 1 , 2 , · · · , n n L ( y , z , α ) = P ( y , z ) + 1 � α i ( y − z i ) n i =1 D ( α ) = inf y , z L ( y , z , α ) n � n � 1 λ 2 � y � 2 + 1 � � = inf z i { f i ( z i ) − α i z i } + inf α i y n n y i =1 i =1 2 � � n n 1 i ( − α i ) − λ 1 � � � � − f ∗ = α i � � n 2 � λ n � � � i =1 i =1 9

SDCA Formulation and relationships n min f ( w ) = 1 � f i ( w ) + 0 . 5 λ w ′ w n i =1 n i = − 1 w t = � α ∗ λ n ∇ f i ( w ∗ ) α t i i =1 Update � α t − 1 − η t ( ∇ f i ( w t − 1 ) + λ n α t − 1 ) l = i α t l l l = α t − 1 l � = i l w t − 1 + w t α t i − α t − 1 � � = i w t − 1 − η t ( ∇ f i ( w t − 1 ) + λ n α t − 1 = ) l λ n α t − 1 is antithetic to ∇ f i ( w t − 1 ), ∇ f i ( w t − 1 ) + λ n α t − 1 → 0 as t → inf l l 10

Stratified Sampling

Stratified sampling Figure 1: Stratified sampling Group 1 Group 2 Group 3 Group 4 ∇ f 11 , · · · , ∇ f 1 n 1 ∇ f 21 , · · · , ∇ f 2 n 2 · · · ∇ f L 1 , · · · , ∇ f Ln L 11

Stratified sampling Principles • homogenous within-groups. • heterogenous between the groups. Stratified sample size b h b = n h • Proportional: n = W h W h S h N h S h • Neyman: b h = b = b L L � W h S h � N h S h h =1 h =1 Apply to stochastic gradient • for the same labels y , cluster x , to stratify. • ( x i , y i ) → ∇ f i ( w ; x i , y i ) 12

Important Sampling

Important Sampling n 1 • Uniform sampling: ∇ f ( w t ) = � n ∇ f i ( w t ) i =1 n ∇ f i ( w ) p t • Important sampling: ∇ f ( w t ) = � i , np t i i =1 � n i =1 p t i = 1 t = 1 , 2 , · · · Figure 2: Important sampling 13

Important Sampling for Stochastic Gradient � n � 2 2 n �∇ f i ( w t ) � 2 ∇ f i t ( w t ) � � 1 ≥ 1 � � � ∇ f i ( w t ) � � � � min = min p t E � � � np t n 2 p t n 2 p t � � i t i i =1 i =1 �∇ f i ( w t ) � p t i = � n j =1 �∇ f j ( w t ) � if f i ( w ) is L i -Lipschitz, then �∇ f i ( w ) � ≤ L i , L i p t i = � n j =1 L j 14

Experiments

Stratified Sampling Figure 3: multi-class logistic regression (convex) on letter, mnist, pendigits, and usps. 15

Important sampling Figure 4: SVM on several datasets 16

Conclusions

conclusions • VR base on optimize variables, such as SDCA. SVRG, can make the variance convergence to 0. • VR base on samples, can significantly reduce the variance. • Constructing related variates is crucial. • Different VR methods can be combined, but how to need our efforts. 17

Advance Stochastic Gradient with Variance Reduction Jingchang Liu - PowerPoint PPT Presentation

Advance Stochastic Gradient with Variance Reduction Jingchang Liu December 7, 2017 University of Science and Technology of China 1 Table of Contents Introductions Control Variates Antithetic Sampling Stratified Sampling Important Sampling

Stochastic Simulation Variance reduction methods Bo Friis Nielsen Applied Mathematics and

Stochastic Simulation Methods: Variance reduction methods Antithetic variables Bo Friis

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Variance reduction Timo Tiihonen 2014 Variance reduction techniques The most efficient way to

Outline 1 Presentation of the problem Truncated Stochastic Algorithms and Variance Reduction:

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Inperia Advance BIS Coated CoCr BMS for BTK Indications DS - 2018 Inperia Advance Inperia

Stochastic Gradient Method: Applications February 03, 2015 P. Carpentier Master MMMEF Cours

Applications of the Stochastic Gradient Method December 11, 2019 P. Carpentier Master

Stochastic Quasi-Gradient Methods: Variance Reduction via Jacobian Sketching Peter Richtrik

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Variance reduction A primer on simplest techniques What is variance reduction Reduce

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al.,

Stratified Monte Carlo Integration and Applications R. El Haddad, R. Fakhereddine, C. L ecot,

Statistics I Chapter 7 Sampling Distributions (Part 1) Ling-Chieh Kung Department of

Computational challenges in fair division Ioannis Caragiannis University of Patras The general

Semi-Heuristic Poverty Two Main Ways of . . . Measures Used by Main Definition Discussion and

Tree-based and GA tools for optimal sampling design The R User Conference 2008 August 12-14,

Rare events: models and simulations Josselin Garnier (Universit e Paris Diderot)

BlinkDB (some figures were poached from the Eurosys conference talk) The Holy Grail Support

Coxs proportional hazards/regression model - model assessment Rasmus Waagepetersen October

Sambuz

Useful Links

Newsletter

Mail Us