Local Function Optimization COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Local Function Optimization 1 / 29
Outline 1 Gradient, Hessian, and Convexity 2 A Local, Unconstrained Optimization Template 3 Steepest Descent 4 Termination 5 Convergence Speed of Steepest Descent 6 Convergence Speed of Newton’s Method 7 Newton’s Method 8 Counting Steps versus Clocking COMPSCI 371D — Machine Learning Local Function Optimization 2 / 29
Motivation and Scope • Parametric predictor: h ( x ; v ) : R d × R m → Y • As a predictor: h ( x ; v ) = h v ( x ) : R d → Y n = 1 ℓ ( y n , h ( x n ; v )) : R m → R � N • Risk: L T ( v ) = 1 N • For risk minimization, h ( x n ; v ) = h x n ( v ) : R m → Y • Training a parametric predictor with m real parameters is function optimization: ˆ m L T ( v ) v ∈ arg min v ∈ R • Some v may be subject to constraints. We ignore those ML problems for now. • Other v may be integer-valued. We ignore those, too (combinatorial optimization). COMPSCI 371D — Machine Learning Local Function Optimization 3 / 29
Example • A binary linear classifier has decision boundary b + w T x = 0 � b � ∈ R d + 1 • So v = w • m = d + 1 • Counterexample: Can you think of a ML method that does not involve v ∈ R m ? COMPSCI 371D — Machine Learning Local Function Optimization 4 / 29
Warning: Change of Notation • Optimization is used for much more than ML • Even in ML, there are more than risks to optimize • So we use “generic notation” for optimization • Function to be minimized f ( u ) : R m → R • More in keeping with literature... ... except that we use u instead of x (too loaded for us!) • Minimizing f ( u ) is the same as maximizing − f ( u ) COMPSCI 371D — Machine Learning Local Function Optimization 5 / 29
Only Local Minimization • All we know about f is a “black box” (think Python function) • For many problems, f has many local minima • Start somewhere ( u 0 ), and take steps “down” f ( u k + 1 ) < f ( u k ) • When we get stuck at a local minimum, we declare success • We would like global minima, but all we get is local ones • For some problems, f has a unique minimum... • ... or at least a single connected set of minima COMPSCI 371D — Machine Learning Local Function Optimization 6 / 29
Gradient, Hessian, and Convexity Gradient ∂ f ∂ u 1 . ∇ f ( u ) = ∂ f . ∂ u = . ∂ f ∂ u m • ∇ f ( u ) is the direction of fastest growth of f at u • If ∇ f ( u ) exists everywhere, the condition ∇ f ( u ) = 0 is necessary and sufficient for a stationary point (max, min, or saddle) • Warning: only necessary for a minimum! • Reduces to first derivative for f : R → R COMPSCI 371D — Machine Learning Local Function Optimization 7 / 29
Gradient, Hessian, and Convexity First Order Taylor Expansion f ( u ) ≈ g 1 ( u ) = f ( u 0 ) + [ ∇ f ( u 0 )] T ( u − u 0 ) approximates f ( u ) near u 0 with a (hyper)plane through u 0 f( u ) u 2 u 0 u 1 ∇ f ( u 0 ) points to direction of steepest increase of f at u 0 • If we want to find u 1 where f ( u 1 ) < f ( u 0 ) , going along −∇ f ( u 0 ) seems promising • This is the general idea of steepest descent COMPSCI 371D — Machine Learning Local Function Optimization 8 / 29
Gradient, Hessian, and Convexity Hessian ∂ 2 f ∂ 2 f . . . ∂ u 2 ∂ u 1 ∂ u m 1 . . H ( u ) = . . . . ∂ 2 f ∂ 2 f . . . ∂ u 2 ∂ u m ∂ u 1 m • Symmetric matrix because of Schwarz’s theorem: ∂ 2 f ∂ 2 f = ∂ u i ∂ u j ∂ u j ∂ u i • Eigenvalues are real because of symmetry • Reduces to d 2 f du 2 for f : R → R COMPSCI 371D — Machine Learning Local Function Optimization 9 / 29
Gradient, Hessian, and Convexity Convexity f( u ) z f( u ) + (1-z) f( u' ) f( u' ) f(z u + (1-z) u' ) u u' z u + (1-z) u' • Convex everywhere : For all u , u ′ in the (open) domain of f and for all z ∈ [ 0 , 1 ] f ( z u + ( 1 − z ) u ′ ) ≤ zf ( u ) + ( 1 − z ) f ( u ′ ) • Convex at u 0 : The function f is convex everywhere in some open neighborhood of u 0 COMPSCI 371D — Machine Learning Local Function Optimization 10 / 29
Gradient, Hessian, and Convexity Convexity and Hessian • If H ( u ) is defined at a stationary point u of f , then u is a minimum iff H ( u ) � 0 • “ � ” means positive semidefinite : u T H u ≥ 0 for all u ∈ R m • Above is definition of H ( u ) � 0 • To check computationally: All eigenvalues are nonnegative • H ( u ) � 0 reduces to d 2 f du 2 ≥ 0 for f : R → R COMPSCI 371D — Machine Learning Local Function Optimization 11 / 29
Gradient, Hessian, and Convexity Second Order Taylor Expansion f ≈ g 2 ( u ) = f ( u 0 ) + [ ∇ f ( u 0 )] T ( u − u 0 ) + ( u − u 0 ) T H ( u 0 )( u − u 0 ) approximates f ( u ) near u 0 with a quadratic equation through u 0 • For minimization, this is useful only when H ( u ) � 0 • Function looks locally like a bowl f( u ) u 2 u 0 u 1 u 1 • If we want to find u 1 where f ( u 1 ) < f ( u 0 ) , going to the bottom of the bowl seems promising • This is the general idea of Newton’s method COMPSCI 371D — Machine Learning Local Function Optimization 12 / 29
A Local, Unconstrained Optimization Template A Template • Regardless of method, most local unconstrained optimization methods fit the following template, given a starting point u 0 : k = 0 while u k is not a minimum compute step direction p k compute step-size multiplier α k > 0 u k + 1 = u k + α k p k k = k + 1 end. COMPSCI 371D — Machine Learning Local Function Optimization 13 / 29
A Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while u k is not a minimum”) • In what direction to proceed ( p k ) • How long a step to take in that direction ( α k ) • Different decisions for the last two lead to different methods with very different behaviors and computational costs COMPSCI 371D — Machine Learning Local Function Optimization 14 / 29
Steepest Descent Steepest Descent: Follow the Gradient • In what direction to proceed: p k = −∇ f ( u k ) • “Steepest descent” or “gradient descent” • Problem reduces to one dimension: h ( α ) = f ( u k + α p k ) • α = 0 ⇒ u = u k • Find α = α k > 0 s.t. f ( u k + α k p k ) is a local minimum along the line • Line search (search along a line) • Q1: How to find α k ? • Q2: Is this a good strategy? COMPSCI 371D — Machine Learning Local Function Optimization 15 / 29
Steepest Descent Line Search • Bracketing triple : • a < b < c , h ( a ) ≥ h ( b ) , h ( b ) ≤ h ( c ) • Contains a (local) minimum! • Split the bigger of [ a , b ] and [ b , c ] in half with a point z • Find a new, narrower bracketing triple involving z and two out of a , b , c • Stop when the bracket is narrow enough (say, 10 − 6 ) • Pinned down a minimum to within 10 − 6 COMPSCI 371D — Machine Learning Local Function Optimization 16 / 29
Steepest Descent Phase 1: Find a Bracketing Triple h( α ) α COMPSCI 371D — Machine Learning Local Function Optimization 17 / 29
Steepest Descent Phase 2: Shrink the Bracketing Triple h( α ) α COMPSCI 371D — Machine Learning Local Function Optimization 18 / 29
Steepest Descent if b − a > c − b z = ( a + b ) / 2 if h ( z ) > h ( b ) ( a , b , c ) = ( z , b , c ) otherwise ( a , b , c ) = ( a , z , b ) end otherwise z = ( b + c ) / 2 if h ( z ) > h ( b ) ( a , b , c ) = ( a , b , z ) otherwise ( a , b , c ) = ( b , z , c ) end end COMPSCI 371D — Machine Learning Local Function Optimization 19 / 29
Termination Termination • Are we still making “significant progress”? • Check f ( u k − 1 ) − f ( u k ) ? (We want this to be strictly positive) • Check � u k − 1 − u k � ? (We want this to be large enough) • Second is more stringent close the the minimum because ∇ f ( u ) ≈ 0 • Stop when � u k − 1 − u k � < δ COMPSCI 371D — Machine Learning Local Function Optimization 20 / 29
Termination Is Steepest Descent a Good Strategy? • “We are going in the direction of fastest descent” • “We choose an optimal step by line search” • “Must be good, no?” Not so fast! (Pun intended) • An example for which we know the answer: f ( u ) = c + a T u + 1 2 u T Q u Q � 0 (convex paraboloid) • All smooth functions look like this close enough to u ∗ u * isocontours COMPSCI 371D — Machine Learning Local Function Optimization 21 / 29
Termination Skating to a Minimum u 0 p 0 * u COMPSCI 371D — Machine Learning Local Function Optimization 22 / 29
Termination How to Measure Convergence Speed • Asymptotics ( k → ∞ ) are what matters • If u ∗ is the true solution, how does � u k + 1 − u ∗ � compare with � u k − u ∗ � for large k ? • Which converges faster: � u k + 1 − u ∗ � ≈ β � u k − u ∗ � 1 or � u k + 1 − u ∗ � ≈ β � u k − u ∗ � 2 ? • Close to convergence these distances are small numbers � u k − u ∗ � 2 ≪ � u k − u ∗ � 1 [Example: ( 0 . 001 ) 2 ≪ ( 0 . 001 ) 1 ] COMPSCI 371D — Machine Learning Local Function Optimization 23 / 29
Recommend
More recommend