Introduction to Machine Learning 5. Optimization Geoff Gordon and Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701x
Optimization • Basic Techniques • Gradient descent • Newton's method • Constrained Convex Optimization • Properties • Lagrange function • Wolfe dual • Batch methods • Distributed subgradient • Bundle methods • Online methods • Unconstrained subgradient • Gradient projections • Parallel optimization
Why
Parameter Estimation • Maximum a Posteriori with Gaussian Prior m 1 2 σ 2 k θ k 2 + X � log p ( θ | X ) = g ( θ ) � h φ ( x i ) , θ i + const . i =1 prior data • We have lots of data • Does not fit on single machine • Bandwidth constraints • May grow in real time • Regularized Risk Minimization yields similar problems (more on this in a later lecture)
Batch and Online • Batch • Very large dataset available • Require parameter only at the end • optical character recognition • speech recognition • image annotation / categorization • machine translation • Online • Spam filtering • Computational advertising • Content recommendation / collaborative filtering
Many parameters • 100 million to 1 Billion users Personalized content provision - impossible to adjust all parameters by heuristic/manually • 1,000-10,000 computers Cannot exchange all data between machines, Distributed optimization, multicore • Large networks Nontrivial parameter dependence structure
4.1 Unconstrained Problems
Convexity 101
Convexity 101 3 2 0.8 1 0.6 0 0.4 0.2 − 1 0 − 2 2 2 0 0 − 3 − 2 − 2 − 2 0 2 • Convex set For x, x 0 ∈ X it follows that λ x + (1 − λ ) x 0 ∈ X for λ ∈ [0 , 1] • Convex function λλ f ( x ) + (1 − λ ) f ( x 0 ) ≥ f ( λ x + (1 − λ ) x 0 ) for λ ∈ [0 , 1]
Convexity 101 • Below-set of convex function is convex f ( λ x + (1 − λ ) x 0 ) ≤ λ f ( x ) + (1 − λ ) f ( x hence λ x + (1 − λ ) x 0 ∈ X for x, x 0 ∈ X • Convex functions don’t have local minima Proof by contradiction - linear interpolation breaks local minimum condition
Convexity 101 • Vertex of a convex set Point which cannot be extrapolated within convex set λ x + (1 � λ ) x 0 62 X for λ > 1 for all x 0 2 X • Convex hull � � � n n � � � co X := ¯ � ¯ x = α i x i where n ∈ N , α i ≥ 0 and α i ≤ 1 x � � i =1 i =1 • Convex hull of set is a convex set (proof trivial)
Convexity 101 • Supremum on convex hull sup f ( x ) = sup f ( x ) x ∈ X x ∈ co X Proof by contradiction • Maximum over convex function on convex set is obtained on vertex • Assume that maximum inside line segment • Then function cannot be convex • Hence it must be on vertex
Gradient descent
One dimensional problems a, b, Precision � Require: Set A = a, B = b repeat if f ′ � A + B � > 0 then 2 B = A + B 2 solution on the left else A = A + B 1 3 6 7 5 4 2 2 end if until ( B − A ) min( | f ′ ( A ) | , | f ′ ( B ) | ) ≤ � x = A + B Output: • Key Idea 2 • For differentiable f search for x with f’(x) = 0 • Interval bisection (derivative is monotonic) • Need log (A-B) - log ε to converge • Can be extended to nondifferentiable problems (exploit convexity in upper bound and keep 5 points)
Gradient descent • Key idea • Gradient points into descent direction • Locally gradient is good approximation of objective function • GD with Line Search • Get descent direction • Unconstrained line search • Exponential convergence for strongly given a starting point x ∈ dom f . convex objective repeat 1. ∆ x := −∇ f ( x ). 2. Line search. Choose step size t via exact or backtracking line search. 3. Update. x := x + t ∆ x . until stopping criterion is satisfied.
Convergence Analysis • Strongly convex function f ( y ) � f ( x ) + h y � x, ∂ x f ( x ) i + m 2 k y � x k 2 • Progress guarantees (minimum x * ) f ( x ) � f ( x ∗ ) � m 2 k x � x ∗ k 2 • Lower bound on the minimum (set y= x * ) f ( x ) � f ( x ∗ ) h x � x ∗ , ∂ x f ( x ) i � m 2 k x ∗ � x k 2 y h x � y, ∂ x f ( x ) i � m 2 k y � x k 2 sup 1 2 m k ∂ x f ( x ) k 2 =
Convergence Analysis • Bounded Hessian f ( y ) f ( x ) + h y � x, ∂ x f ( x ) i + M 2 k y � x k 2 ) f ( x + tg x ) f ( x ) � t k g x k 2 + M 2 t 2 k g x k 2 = 1 2 M k g x k 2 f ( x ) � Using strong convexity 1 2 M k g x k 2 = ) f ( x + tg x ) � f ( x ∗ ) f ( x ) � f ( x ∗ ) � 1 � m h i f ( x ) � f ( x ∗ ) M • Iteration bound m log f ( x ) − f ( x ∗ ) M ✏
Newton’s Method Isaac Newton
Newton Method • Convex objective function f • Nonnegative second derivative ∂ 2 x f ( x ) ⌫ 0 • Taylor expansion f ( x + δ ) = f ( x ) + h δ , ∂ x f ( x ) i + 1 2 δ > ∂ 2 x f ( x ) δ + O ( δ 3 ) gradient Hessian • Minimize approximation & iterate til converged ⇤ − 1 ∂ x f ( x ) ∂ 2 ⇥ x f ( x ) x ← x −
Convergence Analysis • There exists a region around optimality where Newton’s method converges quadratically if f is twice continuously differentiable • For some region around x* gradient is well approximated by Taylor expansion x ∗ � x, ∂ 2 � γ k x ∗ � x k 2 � ↵� ⌦ � ∂ x f ( x ∗ ) � ∂ x f ( x ) � x f ( x ) • Expand Newton update ⇤ − 1 [ ∂ x f ( x n ) � ∂ x f ( x ∗ )] � � � x n � x ∗ � ∂ 2 ⇥ k x n +1 � x ∗ k = x f ( x n ) � � � � ⇤� ⇤ − 1 ⇥ ∂ 2 ∂ f ⇥ = x f ( x n ) x ( x n )[ x n � x ∗ ] � ∂ x f ( x n ) + ∂ x f ( x ∗ ) � � � � � ⇤ − 1 � � k x n � x ∗ k 2 ∂ 2 ⇥ γ x f ( x n ) � � �
Convergence Analysis • Two convergence regimes • As slow as gradient descent outside the region where Taylor expansion is good x ∗ � x, ∂ 2 � γ k x ∗ � x k 2 � ↵� ⌦ � ∂ x f ( x ∗ ) � ∂ x f ( x ) � x f ( x ) • Quadratic convergence once the bound holds � ⇤ − 1 � � k x n � x ∗ k 2 ∂ 2 ⇥ k x n +1 � x ∗ k γ x f ( x n ) � � � • Newton method is affine invariant (proof by chain rule) See Boyd and Vandenberghe, Chapter 9.5 for much more
Newton method rescales space wrong metric x (0) x (2) x (1) from Boyd & Vandenberghe
Newton method rescales space locally adaptive metric x x + ∆ x nsd x + ∆ x nt from Boyd & Vandenberghe
Parallel Newton Method • Good rate of convergence • Few passes through data needed • Parallel aggregation of gradient and Hessian • Gradient requires O(d) data • Hessian requires O(d 2 ) data • Update step is O(d 3 ) & nontrivial to parallelize • Use it only for low dimensional problems
BFGS algorithm Broyden-Fletcher-Goldfarb-Shanno
Basic Idea • Newton-like method to compute descent direction δ i = B − 1 ∂ x f ( x i − 1 ) i • Line search on f in direction x i +1 = x i − α i δ i • Update B with rank 2 matrix B i +1 = B i + u i u > i + v i v > i • Require that Quasi-Newton condition holds B i +1 ( x i +1 − x i ) = ∂ x f ( x i +1 ) − ∂ x f ( x i ) g i g > − B i δ i δ > i B i i B i +1 = B i + α i δ > δ > i B i δ i i g i
Properties • Simple rank 2 update for B • Use matrix inversion lemma to update inverse • Memory-limited versions L-BFGS • Use toolbox if possible (TAO, MATLAB) (typically slower if you implement it yourself) • Works well for nonlinear nonconvex objectives (often even for nonsmooth objectives)
4.2 Constrained Convex Problems
Basic Convexity
Constrained Convex Minimization • Optimization problem minimize f ( x ) x subject to c i ( x ) ≤ 0 for all i Equality is special case • Common constraints Why? • linear inequality constraints h w i , x i + b i 0 • quadratic cone constraints x > Qx + b > x c with Q ⌫ 0 • semidefinite constraints X M ⌫ 0 or M 0 + x i M i ⌫ 0 i
Example - Support Vectors {x | <w x> + b = + 1 } , , {x | <w x> + b = − 1 } Note: h w, x 1 i + b = 1 ◆ , <w x 1 > + b = +1 h w, x 2 i + b = � 1 , <w x 2 > + b = − 1 y i = +1 ❍ x 1 ◆ ❍ x 2 hence h w, x 1 � x 2 i + b = 2 , => <w (x 1 − x 2 )> = 2 ⌧ w ◆ � w 2 , > 2 < y i = − 1 , w => k w k , x 1 � x 2 (x 1 − x 2 ) = hence = ||w|| ◆ ||w|| k w k ❍ ❍ margin , {x | <w x> + b = 0 } ❍ 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b
Lagrange Multipliers • Lagrange function n X L ( x, α ) := f ( x ) + α i c i ( x ) where α i ≥ 0 i =1 • Saddlepoint Condition If there are x* and nonnegative α * such that L ( x ∗ , α ) ≤ L ( x ∗ , α ∗ ) ≤ L ( x, α ∗ ) then x* is an optimal solution to the constrained optimization problem
Recommend
More recommend