Introduction to Machine Learning 5. Optimization Geoff Gordon and - PowerPoint PPT Presentation

Introduction to Machine Learning 5. Optimization Geoff Gordon and Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701x

Optimization • Basic Techniques • Gradient descent • Newton's method • Constrained Convex Optimization • Properties • Lagrange function • Wolfe dual • Batch methods • Distributed subgradient • Bundle methods • Online methods • Unconstrained subgradient • Gradient projections • Parallel optimization

Parameter Estimation • Maximum a Posteriori with Gaussian Prior m 1 2 σ 2 k θ k 2 + X � log p ( θ | X ) = g ( θ ) � h φ ( x i ) , θ i + const . i =1 prior data • We have lots of data • Does not fit on single machine • Bandwidth constraints • May grow in real time • Regularized Risk Minimization yields similar problems (more on this in a later lecture)

Batch and Online • Batch • Very large dataset available • Require parameter only at the end • optical character recognition • speech recognition • image annotation / categorization • machine translation • Online • Spam filtering • Computational advertising • Content recommendation / collaborative filtering

Many parameters • 100 million to 1 Billion users Personalized content provision - impossible to adjust all parameters by heuristic/manually • 1,000-10,000 computers Cannot exchange all data between machines, Distributed optimization, multicore • Large networks Nontrivial parameter dependence structure

4.1 Unconstrained Problems

Convexity 101

Convexity 101 3 2 0.8 1 0.6 0 0.4 0.2 − 1 0 − 2 2 2 0 0 − 3 − 2 − 2 − 2 0 2 • Convex set For x, x 0 ∈ X it follows that λ x + (1 − λ ) x 0 ∈ X for λ ∈ [0 , 1] • Convex function λλ f ( x ) + (1 − λ ) f ( x 0 ) ≥ f ( λ x + (1 − λ ) x 0 ) for λ ∈ [0 , 1]

Convexity 101 • Below-set of convex function is convex f ( λ x + (1 − λ ) x 0 ) ≤ λ f ( x ) + (1 − λ ) f ( x hence λ x + (1 − λ ) x 0 ∈ X for x, x 0 ∈ X • Convex functions don’t have local minima Proof by contradiction - linear interpolation breaks local minimum condition

Convexity 101 • Vertex of a convex set Point which cannot be extrapolated within convex set λ x + (1 � λ ) x 0 62 X for λ > 1 for all x 0 2 X • Convex hull � � � n n � � � co X := ¯ � ¯ x = α i x i where n ∈ N , α i ≥ 0 and α i ≤ 1 x � � i =1 i =1 • Convex hull of set is a convex set (proof trivial)

Convexity 101 • Supremum on convex hull sup f ( x ) = sup f ( x ) x ∈ X x ∈ co X Proof by contradiction • Maximum over convex function on convex set is obtained on vertex • Assume that maximum inside line segment • Then function cannot be convex • Hence it must be on vertex

Gradient descent

One dimensional problems a, b, Precision � Require: Set A = a, B = b repeat if f ′ � A + B � > 0 then 2 B = A + B 2 solution on the left else A = A + B 1 3 6 7 5 4 2 2 end if until ( B − A ) min( | f ′ ( A ) | , | f ′ ( B ) | ) ≤ � x = A + B Output: • Key Idea 2 • For differentiable f search for x with f’(x) = 0 • Interval bisection (derivative is monotonic) • Need log (A-B) - log ε to converge • Can be extended to nondifferentiable problems (exploit convexity in upper bound and keep 5 points)

Gradient descent • Key idea • Gradient points into descent direction • Locally gradient is good approximation of objective function • GD with Line Search • Get descent direction • Unconstrained line search • Exponential convergence for strongly given a starting point x ∈ dom f . convex objective repeat 1. ∆ x := −∇ f ( x ). 2. Line search. Choose step size t via exact or backtracking line search. 3. Update. x := x + t ∆ x . until stopping criterion is satisfied.

Convergence Analysis • Strongly convex function f ( y ) � f ( x ) + h y � x, ∂ x f ( x ) i + m 2 k y � x k 2 • Progress guarantees (minimum x * ) f ( x ) � f ( x ∗ ) � m 2 k x � x ∗ k 2 • Lower bound on the minimum (set y= x * ) f ( x ) � f ( x ∗ )  h x � x ∗ , ∂ x f ( x ) i � m 2 k x ∗ � x k 2 y h x � y, ∂ x f ( x ) i � m 2 k y � x k 2  sup 1 2 m k ∂ x f ( x ) k 2 =

Convergence Analysis • Bounded Hessian f ( y )  f ( x ) + h y � x, ∂ x f ( x ) i + M 2 k y � x k 2 ) f ( x + tg x )  f ( x ) � t k g x k 2 + M 2 t 2 k g x k 2 = 1 2 M k g x k 2  f ( x ) � Using strong convexity 1 2 M k g x k 2 = ) f ( x + tg x ) � f ( x ∗ )  f ( x ) � f ( x ∗ ) � 1 � m h i  f ( x ) � f ( x ∗ ) M • Iteration bound m log f ( x ) − f ( x ∗ ) M ✏

Newton’s Method Isaac Newton

Newton Method • Convex objective function f • Nonnegative second derivative ∂ 2 x f ( x ) ⌫ 0 • Taylor expansion f ( x + δ ) = f ( x ) + h δ , ∂ x f ( x ) i + 1 2 δ > ∂ 2 x f ( x ) δ + O ( δ 3 ) gradient Hessian • Minimize approximation & iterate til converged ⇤ − 1 ∂ x f ( x ) ∂ 2 ⇥ x f ( x ) x ← x −

Convergence Analysis • There exists a region around optimality where Newton’s method converges quadratically if f is twice continuously differentiable • For some region around x* gradient is well approximated by Taylor expansion x ∗ � x, ∂ 2 �  γ k x ∗ � x k 2 � ↵� ⌦ � ∂ x f ( x ∗ ) � ∂ x f ( x ) � x f ( x ) • Expand Newton update ⇤ − 1 [ ∂ x f ( x n ) � ∂ x f ( x ∗ )] � � � x n � x ∗ � ∂ 2 ⇥ k x n +1 � x ∗ k = x f ( x n ) � � � � ⇤� ⇤ − 1 ⇥ ∂ 2 ∂ f ⇥ = x f ( x n ) x ( x n )[ x n � x ∗ ] � ∂ x f ( x n ) + ∂ x f ( x ∗ ) � � � � � ⇤ − 1 � � k x n � x ∗ k 2 ∂ 2 ⇥  γ x f ( x n ) � � �

Convergence Analysis • Two convergence regimes • As slow as gradient descent outside the region where Taylor expansion is good x ∗ � x, ∂ 2 �  γ k x ∗ � x k 2 � ↵� ⌦ � ∂ x f ( x ∗ ) � ∂ x f ( x ) � x f ( x ) • Quadratic convergence once the bound holds � ⇤ − 1 � � k x n � x ∗ k 2 ∂ 2 ⇥ k x n +1 � x ∗ k  γ x f ( x n ) � � � • Newton method is affine invariant (proof by chain rule) See Boyd and Vandenberghe, Chapter 9.5 for much more

Newton method rescales space wrong metric x (0) x (2) x (1) from Boyd & Vandenberghe

Newton method rescales space locally adaptive metric x x + ∆ x nsd x + ∆ x nt from Boyd & Vandenberghe

Parallel Newton Method • Good rate of convergence • Few passes through data needed • Parallel aggregation of gradient and Hessian • Gradient requires O(d) data • Hessian requires O(d 2 ) data • Update step is O(d 3 ) & nontrivial to parallelize • Use it only for low dimensional problems

BFGS algorithm Broyden-Fletcher-Goldfarb-Shanno

Basic Idea • Newton-like method to compute descent direction δ i = B − 1 ∂ x f ( x i − 1 ) i • Line search on f in direction x i +1 = x i − α i δ i • Update B with rank 2 matrix B i +1 = B i + u i u > i + v i v > i • Require that Quasi-Newton condition holds B i +1 ( x i +1 − x i ) = ∂ x f ( x i +1 ) − ∂ x f ( x i ) g i g > − B i δ i δ > i B i i B i +1 = B i + α i δ > δ > i B i δ i i g i

Properties • Simple rank 2 update for B • Use matrix inversion lemma to update inverse • Memory-limited versions L-BFGS • Use toolbox if possible (TAO, MATLAB) (typically slower if you implement it yourself) • Works well for nonlinear nonconvex objectives (often even for nonsmooth objectives)

4.2 Constrained Convex Problems

Basic Convexity

Constrained Convex Minimization • Optimization problem minimize f ( x ) x subject to c i ( x ) ≤ 0 for all i Equality is special case • Common constraints Why? • linear inequality constraints h w i , x i + b i  0 • quadratic cone constraints x > Qx + b > x  c with Q ⌫ 0 • semidefinite constraints X M ⌫ 0 or M 0 + x i M i ⌫ 0 i

Example - Support Vectors {x | <w x> + b = + 1 } , , {x | <w x> + b = − 1 } Note: h w, x 1 i + b = 1 ◆ , <w x 1 > + b = +1 h w, x 2 i + b = � 1 , <w x 2 > + b = − 1 y i = +1 ❍ x 1 ◆ ❍ x 2 hence h w, x 1 � x 2 i + b = 2 , => <w (x 1 − x 2 )> = 2 ⌧ w ◆ � w 2 , > 2 < y i = − 1 , w => k w k , x 1 � x 2 (x 1 − x 2 ) = hence = ||w|| ◆ ||w|| k w k ❍ ❍ margin , {x | <w x> + b = 0 } ❍ 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b

Lagrange Multipliers • Lagrange function n X L ( x, α ) := f ( x ) + α i c i ( x ) where α i ≥ 0 i =1 • Saddlepoint Condition If there are x* and nonnegative α * such that L ( x ∗ , α ) ≤ L ( x ∗ , α ∗ ) ≤ L ( x, α ∗ ) then x* is an optimal solution to the constrained optimization problem

Introduction to Machine Learning 5. Optimization Geoff Gordon and - PowerPoint PPT Presentation

Introduction to Machine Learning 5. Optimization Geoff Gordon and Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701x Optimization Basic Techniques Gradient descent Newton's method

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

A Simple Optimization Model for A Simple Optimization Model for Wireless Opportunistic Routing

One Symmetry to Rule Them All Arjun Bagchi University of Edinburgh Higher Spins, Strings and

Designing a Designing a Feminist Alexa Feminist Alexa An experiment in feminist conversation

A Language and Architecture for Distributed Computing over the Internet Carlos A. Varela

Teaching, Learning and Wellbeing Learning and Teaching Conference 2019 Thurs 21 March #LTC19

From Problem solving to Piloting in 5 days presented by Francisco Jimnez, Phillippa Rose and Dr

Assembly Language Assembler translates the assembly language source into binary instructions in

2 u + f ( x, u ) = 0 where x R 2 , subject to u = 0 on f ( x, u )