15-780: Optimization J. Zico Kolter March 14-16, 2015 1
Outline Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization 2
Outline Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization 3
General (continuous) optimization where is optimization variable, is the objective function , are inequality constraints , and are equality constraints subject to minimize minimize Beyond linear programming Linear programming c T x x subject to Gx ≤ h Ax = b 4
minimize minimize Beyond linear programming Linear programming General (continuous) optimization c T x f ( x ) x x subject to Gx ≤ h subject to g i ( x ) ≤ 0 , i = 1 , . . . , m Ax = b h i ( x ) = 0 , i = 1 , . . . , p where x ∈ R n is optimization variable, f : R n → R is the objective function , g i : R m → R are inequality constraints , and h i : R n → R are equality constraints 4
minimize Example: image deblurring Original image Blurred image Reconstruction Figures from (Wang et. al, 2009) Given corrupted m × n image represented as vector y ∈ R m · n , find x ∈ R m · n by solving the optimization problem ( n − 1 m − 1 ) ∥ K ∗ x − y ∥ 2 ∑ ∑ 2 + λ | x mi − x m ( i +1) | + | x ni − x n ( i +1) | x i =1 i =1 where K ∗ denotes 2D convolution with some filter K 5
minimize Example: machine learning Virtually all machine learning algorithms can be expressed as minimizing a loss function over observed data Given inputs x ( i ) ∈ X , desired outputs y ( i ) ∈ Y , hypothesis function h θ : X → Y defined by parameters θ ∈ R n , and loss function ℓ : Y × Y → R + Machine learning algorithms solve optimization problem m ( , y ( i ) ) ∑ x ( i ) ) ( ℓ h θ θ i =1 6
𝑒 𝑠 minimize Example: robot trajectory planning Figure from (Schulman et al., 2013) Robot state x t and control inputs u t T − 1 ∑ ∥ x t − x t +1 ∥ 2 2 + ∥ u t ∥ 2 2 x 1: T , u 1: T − 1 i =1 subject to x t +1 = f dynamics ( x t , u t ) , (robot dynamics) f collision ( x t ) ≥ 0 . 1 (avoid collisions) x 1 = x init , x T = x goal 7
Many other applications We’ve already seen many applications (i.e., any linear programming setting is also an example of continuous optimization, but there are many other non-linear problems) Applications in control, machine learning, finance, forecasting, signal processing, communications, structural design, any many others The move to optimization-based formalisms has been one of the primary trends in AI in the past 15+ years 8
Outline Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization 9
Classes of optimization problems Constrained Unconstrained Convex Nonconvex Smooth Nonsmooth Many different classifications for (continuous) optimization problems (linear programming, nonlinear programming, quadratic programming, semidefinite programming, second order cone programming, geometric programming, etc) can get overwhelming We focus on three distinctions: unconstrained vs. constrained, convex vs. nonconvex, and (less so) smooth vs. nonsmooth 10
minimize minimize Unconstrained vs. constrained optimization f ( x ) x vs. f ( x ) subject to g i ( x ) ≤ 0 , i = 1 , . . . , m x h i ( x ) = 0 , i = 1 , . . . , p In unconstrained optimization, every point x ∈ R n is “feasible”, so singular focus is on finding a low value of f ( x ) In constrained optimization (where constraints truly need to hold exactly) it may be difficult to find an initial feasible point, and maintain feasibility during optimization Typically leads to different classes of algorithms 11
minimize Convex vs. nonconvex optimization Originally researchers distinguished between linear (easy) and nonlinear (hard) optimization problems But in the 80s and 90s, it became clear that this wasn’t the right line: the real distinction is between convex (easy) and nonconvex (hard) problems The optimization problem f ( x ) x subject to g i ( x ) ≤ 0 , i = 1 , . . . , m h i ( x ) = 0 , i = 1 , . . . , p if f and the g i ’s are all convex functions and the h i ’s are affine functions 12
Convex functions ( y, f ( y )) ( x, f ( x )) A function f : R n → R is convex if, for any x , y ∈ R n and θ ∈ [0 , 1] , f ( θ x + (1 − θ ) y ) ≤ θ f ( x ) + (1 − θ ) f ( y ) f is concave if − f is convex f is affine if it is both convex and concave, must take form f ( x ) = a T x + b for a ∈ R n , b ∈ R 13
Why is convex optimization easy? f 1 ( x ) f 2 ( x ) Nonconvex function Convex function Convex function “curve upward everywhere”, and convex constraints define a convex set (for any x , y that is feasible, so is θ x + (1 − θ ) y for θ ∈ [0 , 1] ) Together, these properties imply that any local optima must also be a global optima Thus, for convex problems we can use local methods to find the globally optimal solution (cf. linear programming vs. integer programming) 14
Smooth vs. Nonsmooth optimization f 1 ( x ) f 2 ( x ) Smooth function Nonsmooth function In optimization, we care about smoothness in terms of whether functions are (first or second order) continuously differentiable A function f is first order continuously differentiable if it’s derivative f ′ exists and is continuous; the Lipschitz constant of its derivative is a constant L such that for all x , y | f ′ ( x ) − f ′ ( y ) | ≤ L | x − y | In the next section, we will use first and second derivative information to optimize functions, so whether or not these exist affect which methods we can apply. 15
Outline Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization 16
Solving optimization problems Starting with the unconstrained, smooth, one dimensional case f ( x ) x To find minimum point x ⋆ , we can look at the derivative of the function f ′ ( x ) : any location where f ′ ( x ) = 0 will be a “flat” point in the function For convex problems, this is guaranteed to be a minimum (instead of a maximum) 17
The gradient ∇ x f ( x ) x 1 x 2 For a multivariate function f : R n , its gradient is a n -dimensional vector containing partial derivatives with respect to each dimension ∂ f ( x ) ∂ x 1 . . ∇ x f ( x ) = . ∂ f ( x ) ∂ x n For continuously differentiable f and unconstrained optimization, optimal point must have ∇ x f ( x ⋆ ) = 0 18
Properties of the gradient x 0 f ( x ) f ( x 0 ) + ∇ x f ( x ) T ( x − x 0 ) x Gradient defines the first order Taylor approximation to the function f around a point x 0 f ( x ) ≈ f ( x 0 ) + ∇ x f ( x 0 ) T ( x − x 0 ) For convex f , first order Taylor approximation is always an underestimate f ( x ) ≥ f ( x 0 ) + ∇ x f ( x 0 ) T ( x − x 0 ) 19
Some common gradients For f ( x ) = a T x gradient is given by ∇ x f ( x ) = a n ∂ f ( x ) = ∂ ∑ a i x i = a i ∂ x i ∂ x i i =1 For f ( x ) = 1 2 x T Qx , gradient is given by ∇ x f ( x ) = 1 2 ( Q + Q T ) x or just ∇ x f ( x ) = Qx if Q is symmetric ( Q = Q T ) 20
= = How do we find ∇ x f ( x ) = 0 ? Direct solution : In some cases, it is possible to analytically compute the x ⋆ such that ∇ x f ( x ⋆ ) = 0 Example: f ( x ) = 2 x 2 1 + x 2 2 + x 1 x 2 − 6 x 1 − 5 x 2 [ 4 x 1 + x 2 + 6 ] ⇒ ∇ x f ( x ) = 2 x 2 + x 1 + 5 [ 4 ] − 1 [ 6 [ 1 ] ] 1 ⇒ x ⋆ = = 1 2 5 2 Iterative methods : more commonly the condition that the gradient equal zero will not have an analytical solution, require iterative methods 21
Gradient descent The gradient doesn’t just give us the optimality condition, it also points in the direction of “steepest ascent” for the function f ∇ x f ( x ) x 1 x 2 Motivates the gradient descent algorithm, which repeatedly takes steps in the direction of the negative gradient Repeat: x ← x − α ∇ x f ( x ) for some step size α > 0 22
10 1 3.0 10 -1 2.5 10 -3 2.0 10 -5 f - f* x2 1.5 10 -7 10 -9 1.0 10 -11 0.5 10 -13 0.0 10 -15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 20 40 60 80 100 x1 Iteration 100 iterations of gradient descent on function f ( x ) = 2 x 2 1 + x 2 2 + x 1 x 2 − 6 x 1 − 5 x 2 23
How do we choose step size α ? Choice of α plays a big role in convergence of algorithm 3.0 3.0 2.5 2.5 2.0 2.0 x2 x2 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x1 x1 α = 0 . 05 α = 0 . 42 24
10 1 alpha = 0.2 10 -1 alpha = 0.42 10 -3 alpha = 0.05 10 -5 f - f* 10 -7 10 -9 10 -11 10 -13 10 -15 0 20 40 60 80 100 Iteration Convergence of gradient descent for different step sizes 25
If we know gradient is Lipschitz continuous with constant L , step size α = 1/ L is good in theory and practice But what if we don’t know Lipschitz constant, or derivative has unbounded Lipschitz constant? Idea #1 (“exact” line search): want to choose α to minimize f ( x 0 + α ∇ f ( x 0 )) for current iterate x 0 ; this is just another optimization problem, but with a single variable α Idea #2 (backtracking line search): try a few α ’s on each iteration until we get one that causes a suitable decrease in the function 26
Recommend
More recommend