Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments On the steplength selection in Stochastic Gradient Methods Giorgia Franchini giorgia.franchini@unimore.it Università degli studi di Modena e Reggio Emilia Como, 16-18 July, 2018 Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties Large-scale optimization in machine learning A numerical experiment: the test problem Future developments Optimization problem in machine learning The following optimization problem, which minimizes the sum of cost functions over samples from a finite training set composed by sample data a i ∈ R d and class label b i ∈ {± 1 } for i ∈ { 1 . . . n } , appears frequently in machine learning: n min F ( x ) ≡ 1 � f i ( x ) , (1) n i = 1 where n is the sample size, and each f i : R d → R is the cost function corresponding to a training set element. For example in the logistic regression case we have: � � 1 + exp ( − b i a T f i ( x ) = log i x ) We are interested in finding x that minimizes (1). Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties Large-scale optimization in machine learning A numerical experiment: the test problem Future developments Stochastic Gradient Descent (SGD) For given x , computing F ( x ) and ∇ F ( x ) is prohibited, due to the large size of the training set; when n is large, Stochastic Gradient Descent (SGD) method and its variants have been the main approaches for solving (1); in the t − th iteration of SGD, a random index of a training sample i t is chosen from { 1 , 2 , . . . , n } and the iterate x t is updated by x t + 1 = x t − η t ∇ f i t ( x t ) where ∇ f i t ( x t ) denotes the gradient of the i t − th component function at x t , and η t > 0 is the step size or learning rate. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments SGD properties Theorem (Strongly Convex Objective, Fixed Step size) Suppose that the SGD method is run with a fixed step size, η t = ¯ η for all t ∈ N , satisfying µ 0 < ¯ η ≤ . LM G Then, the expected optimality gap satisfies the following relation: E [ F ( x t ) − F ∗ ] t →∞ → ¯ η LM − − − 2 c µ . L > 0 Lipschitz constant of the gradient of F ( x ) ; c > 0 strongly convex constant of F ( x ) ; µ and M are related to the first and second moment of the stochastic gradient. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Notation and assumptions L > 0 gradient Lipschitz constant; c > 0 strongly convex constant; there exist scalars µ G ≥ µ > 0 such that, for all k ∈ N , ∇ F ( x t ) T E ξ t [ g ( x t , ξ t )] ≥ µ � ∇ F ( x t ) � 2 2 � E ξ t [ g ( x t , ξ t )] � 2 ≤ µ G � ∇ F ( x t ) � 2 ; there exist scalars M ≥ 0 and M V ≥ 0 such that, for all t ∈ N , V ξ t [ g ( x t , ξ t )] ≤ M + M V � ∇ F ( x t ) � 2 2 ; G ≥ µ 2 > 0. M G := M V + µ 2 Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments SGD properties, Diminishing Step sizes Theorem (Strongly Convex Objective, Diminishing Step sizes) Suppose that the SGD method is run with a step size sequence such that: ∞ ∞ � � η 2 η t = ∞ and t < ∞ . t = 1 t = 1 Then the expected optimality gap satisfies: ν E [ F ( x t ) − F ∗ ] ≤ γ + t , where ν and γ are constant. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Notation β 1 µ η t = γ + t for some β > c µ and γ > 0 such that η 1 ≤ LM G ; � β 2 LM � ν := max 2 ( β c µ − 1 ) , ( γ + 1 )( F ( x 1 ) − F ∗ ) . Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments A numerical experiment: the test problem Logistic regression with l 2 − norm regularization: n x F ( x ) = 1 + λ � � � 2 � x � 2 1 + exp ( − b i a T min log i x ) 2 n i = 1 where a i ∈ R d and b i ∈ {± 1 } are the feature vectors and class labels of the i − th sample, respectively, and λ > 0 is a regularization parameter; database: MNIST 8 and 9 digits, dimension: 11800 × 784. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments MNIST Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments A numerical experiment: the algorithms The two full gradient BB rules: BB1 full gradient: a nonmonotone gradient method with the first Barzilai-Borwein step size rule; Adaptive BB (ABB): a nonmonotone gradient method with a step size rule that alternates between the two BB rules full gradient. Stochastic: ADAM: stochastic gradient method based on adaptive moment estimation; ADAM ABB. behaviour with respect to the epochs: one epoch = 11800 SGD steps. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Deterministic cases = s T = s T t − 1 s t − 1 t − 1 v t − 1 η BB 1 η BB 2 t t s T v T t − 1 v t − 1 t − 1 v t − 1 where s t − 1 = x t − x t − 1 and v t − 1 = ∇ F ( x t ) − ∇ F ( x t − 1 ) with � n ∇ F ( x ) = 1 i = 1 ∇ f i ( x ) . n if η BB 2 � min { η BB 2 : j = max { 1 , t − m a } , . . . , t } , < τ t η ABB min j η BB 1 = t t η BB 1 , otherwise t where m a is a nonnegative integer and τ ∈ ( 0 , 1 ) . [Di Serafino, Ruggiero, Toraldo, Zanni, On the steplength selection in gradient methods for unconstrained optimization, AMC 318, 2018] Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Kingma, Lei Ba, Adam: a method for stochastic optimization, ArXiv, 2017 Algorithm 1 Adam 1: Choose maxit , η , ǫ , β 1 and β 2 ∈ [ 0 , 1 ) , x 0 ; 2: initialize m 0 ← 0, v 0 ← 0, t ← 0 3: for t ∈ { 0 , . . . , maxit } do t ← t + 1 4: g t ← ∇ f i t ( x t − 1 ) 5: m t ← β 1 · m t − 1 + ( 1 − β 1 ) · g t 6: v t ← β 2 · v t − 1 + ( 1 − β 2 ) · g 2 7: √ t 1 − β t η t = η 2 8: ( 1 − β t 1 ) x t ← x t − 1 − η t · m t / ( √ v t + ǫ ) 9: 10: end for 11: Result: x t Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Behaviour of the deterministic and the stochastic methods Optimality gap 10 0 ADAM BB1 FULL GRADIENT ABB FULL GRADIENT 10 -1 F-F * 10 -2 10 -3 0 10 20 30 40 50 60 70 80 90 100 epoch Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Comparison between different SGD types Optimality gap 10 0 ADAM SGD MOMENTUM F-F * 10 -1 10 -2 0 2 4 6 8 10 12 14 16 18 20 epoch Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods
Recommend
More recommend