optimization
play

Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 - PowerPoint PPT Presentation

Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 Logistics Reminders: Thought Question 1 due TODAY, September 17, by 11:59pm To be handed in via eClass Assignment 1 (due Thursday, September 24 ) Tutorial:


  1. Optimization CMPUT 296: Basics of Machine Learning Textbook §4.1-4.4

  2. Logistics Reminders: • Thought Question 1 due TODAY, September 17, by 11:59pm • To be handed in via eClass • Assignment 1 (due Thursday, September 24 ) Tutorial: • Python tutorial from yesterday is available on eClass

  3. Recap: Estimators • An estimator is a random variable representing a procedure for estimating the value of an unobserved quantity based on observed data • Concentration inequalities let us bound the probability of a given estimator being at least from the estimated quantity ϵ • An estimator is consistent if it converges in probability to the estimated quantity

  4. Recap: Sample Complexity • Sample complexity is the number of samples needed to attain a desired error bound at a desired probability 1 − δ ϵ • The mean squared error of an estimator decomposes into bias (squared) and variance • Using a biased estimator can have lower error than an unbiased estimator • Bias the estimator based some prior information • But this only helps if the prior information is correct • Cannot reduce error by adding in arbitrary bias

  5. Outline 1. Recap & Logistics 2. Optimization by Gradient Descent 3. Multivariate Gradient Descent 4. Adaptive Step Sizes 5. Optimization Properties

  6. ̂ Optimization We often want to find the argument that minimizes an objective function w * c w * = arg min w c ( w ) n { ( x i , y i ) } Example: Using linear regression to fit a dataset i =1 • Estimate the targets by y = f ( x ) = w 0 + w 1 x y ( ) f x • Each vector specifies a particular f w ( , ) x y 2 2 n ∑ ( f ( x i ) − y i ) 2 • Objective is the total error c ( w ) = = ( ) { e f x y 1 1 1 ( , ) x y i =1 1 1 x

  7. Stationary Points • Recall that every minimum of an everywhere-differentiable function c ( w ) must* occur at a stationary point : A point at which c ′ ( w ) = 0 ✴ Question: What is the exception? Local Minima • However, not every stationary point is a minimum Saddlepoint • Every stationary point is either: • A local minimum • A local maximum • A saddlepoint Global Minima Global Minimum • The global minimum is either a local minimum, or a boundary point

  8. Numerical Optimization • So a simple recipe for optimizing a function is to find its stationary points; one of those must be the minimum (as long as domain is unbounded) • Question: Why don't we always just do that? • We will almost never be able to analytically compute the minimum of the functions that we want to optimize ✴ (Linear regression is an important exception) • Instead, we must try to find the minimum numerically • Main techniques: First-order and second-order gradient descent

  9. T Taylor Series Definition: A Taylor series is a way of approximating a function in a small c neighbourhood around a point : a ( w − a ) 2 + ⋯ + c ( k ) ( a ) ( a )( w − a ) + c ′ ′ ( a ) ( w − a ) k c ( w ) ≈ c ( a ) + c ′ 2 k ! k c ( i ) ( a ) ∑ ( w − a ) i = c ( a ) + i ! i =1 • Intuition: Following tangent line of the function approximates how it changes • i.e., following a function with the same first derivative • Following a function with the same first and second derivatives is a better approximation; with the same first, second, third derivatives is even better; etc.

  10. ̂ Second-Order Gradient Descent (Newton-Raphson Method) 1. Approximate the target function with a dw [ c ( a ) + c ′ ] 0 = d ( a )( w − a ) + c ′ ′ ( a ) ( w − a ) 2 second-order Taylor series around the 2 current guess : w t ( a ) + 2 c ′ ( a ) w − 2 c ′ ( a ) ′ ′ = c ′ a ( w t ) ( w t )( w − w t ) + c ′ ′ 2 2 ( w − w t ) 2 c ( w ) = c ( w t ) + c ′ 2 = c ′ ( a ) + c ′ ( a )( w − a ) ′ 2. Find the stationary point of the approximation ⟺ − c ′ ( a ) = c ′ ′ ( a )( w − a ) w t +1 ← w t − c ′ ( w t ) ⟺ ( w − a ) = − c ′ ( a ) c ′ ′ ( w t ) c ′ ′ ( a ) ( a ) 3. If the stationary point of the approximation is ⟺ w = a − c ′ a (good enough) stationary point of the c ′ ( a ) ′ objective, then stop. Else, goto 1.

  11. ̂ ̂ (First-Order) Gradient Descent • We can run Newton-Raphson whenever we have access to both the first and second derivatives of the target function • Often we want to only use the first derivative ( why? ) • First-order gradient descent: Replace the second derivative with a 1 (the step size ) in the approximation: constant η ( w t )( w − w t )+ c ′ ( w t ) ′ ( w − w t ) 2 c ( w ) = c ( w t ) + c ′ 2 ( w t )( w − w t )+ 1 2 η ( w − w t ) 2 c ( w ) = c ( w t ) + c ′ • By exactly the same derivation as before: w t +1 ← w t − η c ′ ( w t )

  12. Partial Derivatives • So far: Optimizing univariate function c : ℝ → ℝ c : ℝ d → ℝ • But actually: Optimizing multivariate function is typically h u g e ( is not uncommon) d d ≫ 10,000 • • First derivative of a multivariate function is a vector of partial derivatives Definiton: 
 ∂ f The partial derivative ( x 1 , …, x d ) ∂ x i of a function at with respect to is , where f ( x 1 , …, x d ) x 1 , …, x d x i g ′ ( x i ) g ( y ) = f ( x 1 , …, x i − 1 , y , x i +1 , …, x d )

  13. Gradients The multivariate analog to a first derivative is called a gradient . Definition: f : ℝ d → ℝ x ∈ ℝ d The gradient of a function at is a vector of all the ∇ f ( x ) partial derivatives of at : f x ∂ f ∂ x 1 ( x ) ∂ f ∂ x 2 ( x ) ∇ f ( x ) = ⋮ ∂ f ∂ xd ( x )

  14. Multivariate Gradient Descent First-order gradient descent for multivariate functions is just: c : ℝ → ℝ w t +1 ← w t − η t ∇ c ( w t ) • Notice the subscript on t η We can choose a different for each iteration η t • Indeed, for univariate functions, Newton-Raphson can be understood as first- • 1 order gradient descent that chooses a step size of at each iteration. η t = c ′ ′ ( w t ) Choosing a good step size is crucial to efficiently using first-order gradient descent •

  15. Adaptive Step Sizes (a) Step-size too small (b) Step-size too big (c) Adaptive step-size • If the step size is too small , gradient descent will "work", but take forever • Too big , and we can overshoot the optimum η ∈ℝ + c ( w t − η ∇ c ( w t ) ) • Ideally, we would choose η t = arg min • But that's another optimization! • There are some heuristics that we can use to adaptively guess good values for η t

  16. Line Search Intuition: A simple heuristic: line search • Big step sizes are better so long as 1. Try some largest-reasonable step size they don't overshoot η (0) = η max t • Try a big step size! If it increases c ( w t − η ( s ) the objective, try a smaller one. t ∇ c ( w t ) ) < c ( w t ) 2. Is ? w t +1 ← w t − η ( s ) • Keep trying smaller ones until you If yes, t ∇ c ( w t ) decrease the objective; then start η ( s +1) = τη ( s ) 3. Otherwise, try iteration from again. t + 1 η max t t (for ) and goto 2 τ < 1 • Typically τ ∈ [0.5,0.9]

  17. Optimization Properties 1. Maximizing is the same as minimizing : c ( w ) − c ( w ) arg max w c ( w ) = arg min w − c ( w ) 2. Convex functions have a global minimum at every stationary point c is convex ⟺ c ( t w 1 + (1 − t ) w 2 ) ≤ tc ( w 1 ) + (1 − t ) c ( w 2 ) 3. Identifiability: Sometimes we want the actual global minimum ; other times we want a good-enough minimizer (i.e., local minimum might be OK). 4. Equivalence under constant shifts: Adding, subtracting, or multiplying by a positive constant does not change the minimizer of a function: ∀ k ∈ ℝ + arg min w c ( w ) = arg min w c ( w )+ k = arg min w c ( w ) − k = arg min w kc ( w )

  18. Summary • We often want to find the argument that minimizes an objective function : w * c w * = arg min w c ( w ) • Every interior minimum is a stationary point , so check the stationary points • Stationary points usually identified numerically • Typically, by gradient descent • Choosing the step size is important for efficiency and correctness • Common approach: Adaptive step size • E.g., by line search

Recommend


More recommend