Optimization for Machine Learning Lecture 4: Quasi-Newton Methods S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 28
The Story So Far Two Different Philosophies Online Algorithms: Use a small subset of the data at a time and repeatedly cycle Batch Optimization: Use the entire dataset to compute gradients and function values Gradient Based Approaches Bundle Methods: Lower bound the objective function using gradients Quasi-Newton algorithms: Use the gradients to estimate the Hessian (build a quadratic approximation of the objective) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 28
The Story So Far Two Different Philosophies Online Algorithms: Use a small subset of the data at a time and repeatedly cycle Batch Optimization: Use the entire dataset to compute gradients and function values Gradient Based Approaches Bundle Methods: Lower bound the objective function using gradients Quasi-Newton algorithms: Use the gradients to estimate the Hessian (build a quadratic approximation of the objective) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 28
Classical Quasi-Newton Algorithms Outline Classical Quasi-Newton Algorithms 1 Non-smooth Problems 2 BFGS with Subgradients 3 Experiments 4 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 28
Classical Quasi-Newton Algorithms Broyden, Fletcher, Goldfarb, Shanno S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 28
Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28
Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28
Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w w t +1 = w t − H − 1 ∇ J ( w t ) t S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28
Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w w t +1 = w t − η t H − 1 ∇ J ( w t ) t η t is a step size usually found via a line search S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28
Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w w t +1 = w t − η t B t ∇ J ( w t ) η t is a step size usually found via a line search B t = H − 1 is a symmetric PSD matrix t S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28
Classical Quasi-Newton Algorithms Standard BFGS - II B Matrix Update Update B by B t +1 = argmin || B − B t || w s.t. s t = By t B y t = ∇ J ( w t +1 ) − ∇ J ( w t ) is the difference of gradients s t = w t +1 − w t is the difference in parameters This yields the update formula I − s t y ⊤ I − y t s ⊤ s t s ⊤ � � � � t t t B t +1 = + B t � s t , y t � � s t , y t � � s t , y t � Limited memory variant: use a low-rank approximation to B S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28
Classical Quasi-Newton Algorithms Standard BFGS - II B Matrix Update Update B by B t +1 = argmin || B − B t || w s.t. s t = By t B y t = ∇ J ( w t +1 ) − ∇ J ( w t ) is the difference of gradients s t = w t +1 − w t is the difference in parameters This yields the update formula I − s t y ⊤ I − y t s ⊤ s t s ⊤ � � � � t t t B t +1 = + B t � s t , y t � � s t , y t � � s t , y t � Limited memory variant: use a low-rank approximation to B S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28
Classical Quasi-Newton Algorithms Standard BFGS - II B Matrix Update Update B by B t +1 = argmin || B − B t || w s.t. s t = By t B y t = ∇ J ( w t +1 ) − ∇ J ( w t ) is the difference of gradients s t = w t +1 − w t is the difference in parameters This yields the update formula I − s t y ⊤ I − y t s ⊤ s t s ⊤ � � � � t t t B t +1 = + B t � s t , y t � � s t , y t � � s t , y t � Limited memory variant: use a low-rank approximation to B S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28
Classical Quasi-Newton Algorithms Standard BFGS - II B Matrix Update Update B by B t +1 = argmin || B − B t || w s.t. s t = By t B y t = ∇ J ( w t +1 ) − ∇ J ( w t ) is the difference of gradients s t = w t +1 − w t is the difference in parameters This yields the update formula I − s t y ⊤ I − y t s ⊤ s t s ⊤ � � � � t t t B t +1 = + B t � s t , y t � � s t , y t � � s t , y t � Limited memory variant: use a low-rank approximation to B S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28
Classical Quasi-Newton Algorithms Line Search Wolfe Conditions Sufficient decrease: J ( w t + η t d t ) ≤ J ( w t ) + c 1 η t �∇ J ( w t ) , d t � Curvature condition: �∇ J ( w t + η t d t ) , d t � ≥ c 2 �∇ J ( w t ) , d t � , where 0 < c 1 < c 2 < 1. S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 28
Classical Quasi-Newton Algorithms Line Search Wolfe Conditions Sufficient decrease: J ( w t + η t d t ) ≤ J ( w t ) + c 1 η t �∇ J ( w t ) , d t � Curvature condition: �∇ J ( w t + η t d t ) , d t � ≥ c 2 �∇ J ( w t ) , d t � , where 0 < c 1 < c 2 < 1. S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 28
Non-smooth Problems Outline Classical Quasi-Newton Algorithms 1 Non-smooth Problems 2 BFGS with Subgradients 3 Experiments 4 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 28
Non-smooth Problems Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28
Non-smooth Problems Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28
Non-smooth Problems Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this Houston we Have a Problem! S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28
Non-smooth Problems Subgradients A subgradient at x ′ is any vector s which satisfies f ( x ) ≥ f ( x ′ ) + � x − x ′ , s � for all x Set of all subgradients is denoted as ∂ f ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 28
Non-smooth Problems Subgradients A subgradient at x ′ is any vector s which satisfies f ( x ) ≥ f ( x ′ ) + � x − x ′ , s � for all x Set of all subgradients is denoted as ∂ f ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 28
Non-smooth Problems Subgradients A subgradient at x ′ is any vector s which satisfies f ( x ) ≥ f ( x ′ ) + � x − x ′ , s � for all x Set of all subgradients is denoted as ∂ f ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 28
Non-smooth Problems Why is Non-Smooth Optimization Hard? The Key Difficulties A negative subgradient direction � = a descent direction Abrupt changes in function value can occur It is difficult to detect convergence 3 2 1 0 − 3 − 2 − 1 0 1 2 3 f ( x ) = | x | and ∂ f (0) = [ − 1 , 1] S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 28
Recommend
More recommend