Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced
Agenda • Introduction, Problem Statement and Motivations • Overview on Quasi-Newton Optimization Methods • L-BFGS Trust Region Optimization Method • Proposed Methods for Initialization of L-BFGS • Application in Deep Learning (Image Classification Task)
Introduction, Problem Statement and Motivations
Unconstrained Optimization Problem N w ∈ R n L ( w ) , 1 X min ` i ( w ) N i =1 L : R n → R
Optimization Algorithms Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838
Optimization Algorithms 1. Start from a random point w 0 2. Repeat each iteration, k = 0 , 1 , 2 , . . . , 3. Choose a search direction p k 4. Choose a step size α k 5. Update parameters w k +1 ← w k + α k p k 6. Until kr L k < ✏
Properties of Objective Function N w ∈ R n L ( w ) , 1 X min ` i ( w ) N i =1 • are both large in modern applications. n and N • is a non-convex and nonlinear function. L ( w ) • is ill-conditioned. r 2 L ( w ) • Computing full gradient, is expensive. r L • Computing Hessian, is not practical. r 2 L
Stochastic Gradient Decent 1. Sample indices S k ⊂ { 1 , 2 , . . . , N } 2. Compute stochastic (subsampled) gradient 1 r L ( w k ) ⇡ r L ( w k ) ( S k ) , X r ` i ( w k ) |S k | i ∈ S k 3. Assign a learning rate α k p k = �r L ( w k ) ( S k ) 4. Update parameters using w k +1 w k � α k r L ( w k ) ( S k ) H. Robbins, D. Siegmund. (1971). ”A convergence theorem for non negative almost supermartingales and some applications”. Optimizing methods in statistics.
Advantages of SGD • SGD algorithms are very easy to implement. • SGD requires only computing the gradient. • SGD has a low cost-per-iteration. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838
Disadvantages of SGD • Very sensitive to the ill-conditioning problem and scaling. • Requires fine-tuning many hyper-parameters. • Unlikely exhibit acceptable performance on first try. • requires many trials and errors. • Can stuck in a saddle-point instead of local minimum. • Sublinear and slow rate of convergence. Bottou et al., (2016). Optimization methods for large-scale machine learning. print arXiv:1606.04838 J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
Second-Order Methods 1. Sample indices S k ⊂ { 1 , 2 , . . . , N } 2. Compute stochastic (subsampled) gradient 1 r L ( w k ) ⇡ r L ( w k ) ( S k ) , X r ` i ( w k ) |S k | i ∈ S k 3. Compute Hessian 1 r 2 L ( w k ) ⇡ r 2 L ( w k ) ( S k ) , X r 2 ` i ( w k ) |S k | i ∈ S k
Second-Order Methods 4. Compute Newton’s direction p k = �r 2 L ( w k ) − 1 r L ( w k ) 5. Find proper step length α k = min α L ( w k + α p k ) 6. Update parameters w k +1 ← w k + α k p k
Second-Order Methods Advantages • The rate of convergence is super-linear (quadratic for Newton method). • They are resilient to problem ill-conditioning. • They involve less parameter tuning. • They are less sensitive to the choice of hyper- parameters. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
Second-Order Methods Disadvantages • Computing the Hessian matrix is very expensive and requires massive storage. • Computing the inverse of Hessian is not practical. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838
Quasi-Newton Methods 1. Construct a low-rank approximation of Hessian B k ⇡ r 2 L ( w k ) 2. Find the search direction by Minimizing the Quadratic Model of the objective function k p + 1 2 p T B k p T p ∈ R n Q k ( p ) , g T p k = argmin
Quasi-Newton Matrices • Symmetric • Easy and Fast Computation • Satisfies Secant Condition B k +1 s k = y k s k , w k +1 − w k y k , r L ( w k +1 ) � r L ( w k )
B royden F letcher G oldfarb S hanno. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
B royden F letcher G oldfarb S hanno. 1 1 B k s k s T y k y T B k +1 = B k − k B k + k , s T y T k B k s k k s k s k , w k +1 − w k y k , r L ( w k +1 ) � r L ( w k ) B 0 = γ k I J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
Quasi-Newton Methods Advantages • The rate of convergence is super-linear. • They are resilient to problem ill-conditioning. • The second derivative is not required. • They only use the gradient information to construct quasi-Newton matrices. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
Quasi-Newton Methods disadvantages • The cost of storing the gradient informations can be expensive. • The quasi-Newton matrix can be dense. • The quasi-Newton matrix grow in size and rank in large-scale problems. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
Limited-Memory BFGS Limited Memory Storage ⇥ ⇤ ⇥ ⇤ S k = s k − m . . . s k − 1 Y k = y k − m . . . y k − 1 L-BFGS Compact Representation B k = B 0 + Ψ k M k Ψ T k B 0 = γ k I where � − 1 − S T k B 0 S k − L k ⇥ ⇤ B 0 S k Y k , M k = Ψ k = − L T D k k S T k Y k = L k + D k + U k
Limited-Memory Quasi-Newton Methods • Low rank approximation • Small memory of recent gradients. • Low cost of computation of search direction. • Linear or superlinear Convergence rate can be achieved. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
Objectives B 0 = γ k I B k = B 0 + Ψ k M k Ψ T k What is the best choice for initialization?
Overview on Quasi-Newton Optimization Strategies
Line Search Method w k �r L ( w k ) p k Quadratic model k p + 1 2 p T B k p T p ∈ R n Q k ( p ) , g T p k = argmin if B k is positive definite: p k = B − 1 k g k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
Line Search Method w k w k +1 p k α k p k Wolfe conditions L ( w k + α k p k ) L ( w k ) + c 1 α k r L ( w k ) T p k r L ( w k + α k p k ) T p k � c 2 r f ( w k ) T p k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
Trust Region Method p k = arg min p ∈ R n Q ( p ) δ k s.t. k p k 2 δ k w k p k Toint et al. (2000), Trust Region Methods, SIAM.
Trust Region Method 6 Global 4 minimum 2 0 Newton step -2 Local minimum -4 -6 -6 -4 -2 0 2 4 6 J. J. Mor´e and D. C. Sorensen, (1984) Newton’s method, in Studies in Mathematics, Volume 24. Studies in Numerical Analysis, Math. Assoc., pp. 29–82.
L-BFGS Trust Region Optimization Method
L-BFGS in Trust Region B k = B 0 + Ψ k M k Ψ T k Eigen-decomposition Λ + γ k I � 0 P T B k = P γ k I 0 Sherman-Morrison-Woodbury Formula k = − 1 I − Ψ k ( τ ∗ M − 1 + Ψ T k Ψ k ) − 1 Ψ T ⇥ ⇤ p ∗ g k , k k τ ∗ τ ∗ = γ k + σ ∗ L. Adhikari et al. (2017) “Limited-memory trust-region methods for sparse relaxation,” in Proc.SPIE, vol. 10394. Brust et al, (2017). “On solving L-SR1 trust-region subproblems,” Computational Optimization and Applications, vol. 66, pp. 245–266.
L-BFGS in Trust Region vs. Line-Search 26th European Signal Processing Conference, Rome, Italy. September 2018.
Proposed Methods for Initialization of L-BFGS
Initialization Method I B 0 = γ k I B k = B 0 + Ψ k M k Ψ T k Spectral estimate of Hessian γ k = y T k − 1 y k − 1 s T k − 1 y k − 1 k B − 1 0 y k − 1 � s k − 1 k 2 γ k = arg min 2 , B 0 = γ I γ J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
Initialization Method II Consider a quadratic function L ( w ) = 1 2 w T Hw + g T w r 2 L ( w ) = H We have HS k = Y k Therefore S T k HS k = S T k Y k Erway et al. (2018). “Trust-Region Algorithms for Training Responses: Machine Learning Methods Using Indefinite Hessian Approximations,” ArXiv e-prints.
Initialization Method II Since B k = B 0 + Ψ k M k Ψ T B 0 = γ k I k � − 1 − S T k B 0 S k − L k ⇥ ⇤ B 0 S k Y k , M k = Ψ k = − L T D k k Secant Condition B k S T k = Y k We have S T k HS k − γ k S T k S k = S T k Ψ k M k Ψ T k S k
Initialization Method II S T k HS k − γ k S T k S k = S T k Ψ k M k Ψ T k S k General Eigen-Value Problem ( L k + D k + L T k ) z = λ S T k S k z Upper bound for initial value to avoid false curvature information γ k ∈ (0 , λ min )
Initialization Method III B k = B 0 + Ψ k M k Ψ T B 0 = γ k I k S T k HS k − γ k S T k S k = S T k Ψ k M k Ψ T k S k Note that compact representation matrices contains γ k � − 1 − S T k B 0 S k − L k ⇥ ⇤ B 0 S k Y k , M k = Ψ k = − L T D k k
Recommend
More recommend