Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization Vien V. Mai and Mikael Johansson KTH - Royal Institute of Technology
Stochastic optimization Stochastic optimization problem: � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S Stochastic gradient descent (SGD): x k +1 = x k − α k g k , g k ∈ ∂f ( x k , S k ) SGD with momentum: x k +1 = x k − α k z k , z k +1 = β k g k +1 + (1 − β k ) z k Includes Polyak’s Heavy ball, Nesterov’s fast gradient, and more • widespread empirical success • theory less clear than deterministic counterpart V. V. Mai (KTH) ICML-2020 2 / 20
Stochastic optimization: sample complexity For SGD, sample complexity is known under various assumptions • convexity [Nemirovski et al., 2009] • smoothness [Ghadimi-Lan, 2013] • weak convexity [Davis-Drusvyatskiy, 2019] Much less is known for momentum-based methods • constrained • non-smooth non-convex V. V. Mai (KTH) ICML-2020 3 / 20
Our contributions Novel Lyapunov analysis for (projected) stochastic heavy ball (SHB): • sample complexity of SHB for stochastic weakly convex minimization • analyze smooth non-convex case under less restrictive assumptions V. V. Mai (KTH) ICML-2020 4 / 20
Outline • Background and motivation • SHB for non-smooth non-convex optimization • Sharper results for smooth non-convex optimization • Numerical examples • Summary and conclusions V. V. Mai (KTH) ICML-2020 5 / 20
Problem formulation Problem: � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S X is closed and convex; f is ρ - weakly convex , meaning that x �→ f ( x ) + ρ � x � 2 2 is convex . Easy to recognize, e.g., convex compositions f ( x ) = h ( c ( x )) h convex and L h -Lipschitz; c smooth with L c -Lipschitz Jacobian ( ρ = L h L c ) V. V. Mai (KTH) ICML-2020 6 / 20
Algorithm Consider � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S Algorithm: � � z k , x − x k � + 1 � 2 α � x − x k � 2 x k +1 = argmin 2 x ∈X z k +1 = βg k +1 + (1 − β ) x k − x k +1 α Recovers SHB when X = R n ; setting β = 1 gives (projected) SGD Goal: establish sample complexity V. V. Mai (KTH) ICML-2020 7 / 20
Roadmap and challenges Most complexity results for subgradient-based methods rely on forming: E [ V k +1 ] ≤ E [ V k ] − α E [ e k ] + α 2 C 2 Immediately yields O (1 /ǫ 2 ) complexity for E [ e k ] Stationarity measure: ⇒ e k = �∇ f ( x k ) � 2 • f convex = ⇒ e k = f ( x k ) − f ( x ⋆ ) ; f smooth = 2 ⇒ e k = �∇ F λ ( x k ) � 2 • f weakly convex = 2 Lyapunov analysis (for SGD): ⇒ V k = � x k − x ⋆ � 2 • f convex = [Shor, 1964] 2 • f smooth = ⇒ V k = f ( x k ) [Ghadimi-Lan, 2013] • f weakly convex = ⇒ V k = F λ ( x k ) [Davis-Drusvyatskiy, 2019] V. V. Mai (KTH) ICML-2020 8 / 20
Convergence to stationarity in weakly convex cases Moreau envelope F ( y ) + 1 � � 2 λ � x − y � 2 F λ ( x )=inf 2 y Proximal mapping F ( y ) + 1 2 λ � x − y � 2 � � x := argmin ˆ 2 y ∈ R n Connection to near-stationarity x ˆ λ �∇ F λ ( x ) � λ − 1 ( x − ˆ x ) = ∇ F λ ( x ) x dist(0 , ∂F (ˆ x )) ≤ �∇ F λ ( x ) � 2 Small �∇ F λ ( x ) � 2 = ⇒ x close to a near-stationary point V. V. Mai (KTH) ICML-2020 9 / 20
Lyapunov analysis for SHB Recall that we wanted E [ V k +1 ] ≤ E [ V k ] − α E [ e k ] + α 2 C 2 SGD works with e k = �∇ F λ ( x k ) � 2 2 and V k = F λ ( x k ) It seems natural to take e k = �∇ F λ ( · ) � 2 2 Two questions: • at which point should we evaluate ∇ F λ ( · ) ? • can we find a corresponding Lyapunov function V k ? V. V. Mai (KTH) ICML-2020 10 / 20
Lyapunov analysis for SHB Our approach: Take ∇ F λ ( · ) at the following iterate x k := x k + 1 − β ¯ ( x k − x k − 1 ) β Define the corresponding proximal point F ( y ) + 1 � � x k � 2 x k = argmin ˆ 2 λ � y − ¯ 2 y ∈ R n This gives x k ) = λ − 1 (¯ e k = ∇ F λ (¯ x k − ˆ x k ) V. V. Mai (KTH) ICML-2020 11 / 20
Lyapunov analysis for SHB Let β = να so that β ∈ (0 , 1] and define ξ = (1 − β ) /ν . Consider the function: x k ) + νξ 2 2 + αξ 2 � (1 − β ) ξ 2 + ξ � 4 λ 2 � p k � 2 2 λ 2 � d k � 2 V k = F λ (¯ 2 + f ( x k − 1 ) , 2 λ 2 λ where p k = 1 − β ( x k − x k − 1 ) and d k = ( x k − 1 − x k ) /α. β Theorem: For any k ∈ N , it holds that 2 ] + α 2 CL 2 E [ V k +1 ] ≤ E [ V k ] − α x k ) � 2 2 E [ �∇ F λ (¯ . 2 λ V. V. Mai (KTH) ICML-2020 12 / 20
Main result: sample complexity √ √ Taking α = α 0 / K and β = O (1 / K ) ∈ (0 , 1] yields � ρ ∆ + L 2 � �� � 2 � � √ E � ∇ F 1 / (2 ρ ) (¯ x k ∗ ) ≤ O 2 K + 1 ∆ = f ( x 0 ) − inf x ∈X f ( x ) Note: • same worst-case complexity as SGD ( β = 1 ) √ • β can be as small as O (1 / K ) • (much) more weight to the momentum term than the fresh subgradient This rate is, in general, not possible to improve [Arjevani et al., 2019]. V. V. Mai (KTH) ICML-2020 13 / 20
Outline • Background and motivation • SHB for non-smooth non-convex optimization • Sharper results for smooth non-convex optimization • Numerical examples • Summary and conclusions V. V. Mai (KTH) ICML-2020 14 / 20
Smooth and non-convex optimization Problem: � minimize f ( x ) := E P [ f ( x ; S )] = f ( x ; s ) dP ( s ) x ∈X S X is closed and convex; f is ρ - smooth : �∇ f ( x ) − ∇ f ( x ) � 2 ≤ ρ � x − y � 2 , ∀ x, y ∈ dom f. Assumption . There exists a real σ > 0 such that for all x ∈ X : � � � f ′ ( x, S ) − ∇ f ( x ) � 2 ≤ σ 2 . E 2 Note. • complexity of SHB is not known (even for deterministic case) • when X = R n , O (1 /ǫ 2 ) obtained under bounded gradients assumption [Yan et al., 2018] V. V. Mai (KTH) ICML-2020 15 / 20
Improved complexities on smooth non-convex problems Constrained case: α 0 Suppose that �∇ f ( x ) � 2 ≤ G for all x ∈ X . If we set α = √ K +1 , then � ρ ∆ + σ 2 + G 2 � � � x k ∗ ) � 2 √ E �∇ F λ (¯ ≤ O . 2 K + 1 Unconstrained case: α 0 If we set α = √ K +1 with α 0 ∈ (0 , 1 / (4 ρ )] , then �� � 1 + 8 ρ 2 α 2 � ∆ + ( ρ + 16 α 0 ρ 2 ) σ 2 α 3 � � x k ∗ ) � 2 0 0 √ E �∇ F λ (¯ ≤ O . 2 α 0 K + 1 V. V. Mai (KTH) ICML-2020 16 / 20
Experiments: convergence behavior on phase retrieval (a) κ = 1 , α 0 = 0 . 1 (b) κ = 1 , α 0 = 0 . 15 √ Figure: Function gap vs. #iters for phase retrieval with p fail = 0 . 2 , β = 10 / K . Exponential growth before eventual convergence 1 not shown SGD is competitive if well-tuned, but sensitive to stepsize choice 1 observed also in [Asi-Duchi, 2019] V. V. Mai (KTH) ICML-2020 17 / 20
Experiments: sensitivity to initial stepsize √ √ (a) β = 1 / K (b) β = 1 /α 0 / K Figure: #epochs to achieve ǫ -accuracy vs. initial stepsize α 0 with κ = 10 . V. V. Mai (KTH) ICML-2020 18 / 20
Experiments: popular momentum parameter (a) 1 − β = 0 . 9 (b) 1 − β = 0 . 99 Figure: #epochs to achieve ǫ -accuracy vs. initial stepsize α 0 with κ = 10 . V. V. Mai (KTH) ICML-2020 19 / 20
Conclusion SGD with momentum • simple modifications to SGD • good performance and less sensitive to algorithm parameters Novel Lyapunov analysis • sample complexity of SHB for weakly convex and constrained optim. • improved rates on smooth and non-convex problems V. V. Mai (KTH) ICML-2020 20 / 20
Recommend
More recommend