Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright
Convex Optimization- Chapter 1: Introduction mathematical optimization Least-squares and linear programming Convex optimization Nonlinear optimization
Mathematical optimization (mathematical) optimization problem minimize f 0 ( x ) f i ( x ) ≤ b i , i = 1 , . . . , m subject to x = ( x 1 , ..., x n ): optimization variables f 0 : R n → R objective function f i : R n → R , i = 1 , . . . , m constraint functions optimal solution x ∗ has smallest value of f 0 among all vectors that satisfy the constraints
Example portfolio optimization variables: amounts invested in different assets constraints: budget, max./min. investment per asset, minimum return objective: overall risk or return variance data fitting variables: model parameters constraints: prior information, parameter limits objective: measure of misfit or prediction error
Solving optimization problems general optimization problem very difficult to solve methods involve some compromise, e.g., very long computation time, or not always finding the solution exceptions: certain problem classes can be solved efficiently and reliably least-squares problems linear programming problems convex optimization problems
Least-squares || Ax − b || 2 minimize 2 solving least-squares problems analytical solution: x ∗ = ( A T A ) − 1 A T b reliable and efficient algorithms and software computation time proportional to n 2 k ( A ∈ R k × n ); less if structured a mature technology using least-squares least-squares problems are easy to recognize a few standard techniques increase flexibility (e.g., including weights, adding regularization terms)
Linear programming c T x minimize a T i x ≤ b , i = 1 , . . . , m subject to solving linear programs no analytical formula for solution reliable and efficient algorithms and software computation time proportional to n 2 m if m ≥ n ; less with structure a mature technology using linear programming a few standard tricks used to convert problems into linear programs (e.g., problems involving l 1 - or l 2 -norms, piecewise-linear functions)
Chebyshev approximation problem minimize max i =1 ,..., k | a T i x − b i | Approximate linear problem minimize t a T subject to i x − t ≤ b i , i = 1 , . . . , k − a T i x − t ≤ − b i , i = 1 , . . . , k
Convex optimization problem f 0 ( x ) minimize subject to f i ( x ) ≤ b i i = 1 , . . . , m objective and constraint functions are convex: f i ( α x + β y ) ≤ α f i ( x ) + β f i ( y ) if α + β = 1 , α ≥ 0 , β ≥ 0 includes least-squares problems and linear programs as special cases
Convex optimization problem solving convex optimization problems no analytical solution reliable and efficient algorithms computation time (roughly) proportional to max { n 3 , n 2 m , F } , where F is cost of evaluating f i ’s and their first and second derivatives almost a technology using convex optimization often difficult to recognize many tricks for transforming problems into convex form surprisingly many problems can be solved via convex optimization
Nonlinear optimization traditional techniques for general nonconvex problems involve compromises local optimization methods (nonlinear programming) find a point that minimizes f 0 among feasible points near it fast, can handle large problems require initial guess provide no information about distance to (global) optimum global optimization methods find the (global) solution worst-case complexity grows exponentially with problem size these algorithms are often based on solving convex subproblems
Optimization and Machine Learning m 1 � 2 w T w + C minimize w , b ,ξ ξ i i =1 y i ( w T x i + b ) ≥ 1 − ξ i , subject to ξ i ≥ 0 , 1 ≤ i ≤ m . Its dual 1 2 α T YX T XY α − α T 1 minimize α � y i α i = 0 , 0 ≤ α i ≤ C , subject to i Y = Diag ( y 1 , . . . , y m ) X = [ x 1 , . . . , x m ] ∈ R n × m w = � m i =1 α i x i f ( x ) = sgn ( w T x + b )
More powerful classifiers allow kernel. K ij := < φ ( x i ) , φ ( x j ) > is the kernel matrix w = � m i =1 α i φ ( x i ) f ( x ) = sgn [ � m i =1 α i K ( x i , x ) + b ]
Themes of algorithms General techniques for convex quadratic programming have limited appeal (1) large problem size (2) ill-condition Hessian 1 decomposition Rather than computing a step in all components of α at once, these methods focus on a relatively small subset and fix the other components. 2 regularized solutions
decomposition approach Early approach: Works with a subset B ⊂ { 1 , 2 , . . . , s } , whose size is assume to exceed the number of nonzero component of α . Replaces one element of B at each iteration and then re-solves the reduced problem The sequential minimal optimization (SMO): works with just two components of α at each iteration, reducing each QP sub-problem to triviality.
decomposition approach SVM light : Uses a linearization of the objective around the current point to choose the working set B to be the indices most likely to give decent, giving a fixed size limitation on B . Shrinking reduces the workload further by eliminating computation associated with components of α that seem to be at their lower or upper bounds. But the computation is more complex it needs further computational savings. Interior-point method : It is hardly efficient in large problems (duo to ill-conditioning of the kernel matrix) Replace Hessian with a low ranked matrix ( VV T , where V ∈ R m × r for r ≪ m ) Co-ordinate relaxation procedure
regularized solutions 1 Regularized solutions generalized better 2 Regularized solutions provide simplicity. ( w is sparse) minimize w φ γ ( w ) = f ( w ) + γ r ( w ) f ( w ) = � ξ i r ( w ) = 1 2 w T w γ = 1 C trade-off between minimizing mis-classification error and reducing || w || 2
Applications Image denoising: r : total-variation (TV) norm result: large areas of constant intensity (a cartoon like appearance) Matrix completion: W is the matrix variable Regularizer spectral norm : sum of singular values of W This regularizer favors matrices with low rank Lasso procedure l 1 -norm f is least squares
Algorithm 1 Gradient and subgradient methods: w k +1 ← w k − δ k g k This method ensures sub-linear convergence φ γ ( w k ) − φ γ ( w ∗ ) ≤ O ( 1 k 2 )
Algorithm 1 Second approach: ( w − w k ) T ▽ f ( w k ) + γ r ( w ) w k +1 := arg min w + 1 2 µ || w − w k || 2 2 If works well for f with Lipschitz continuous gradient. Sub-linear rate of convergence O ( 1 K ). In special cases O ( 1 K 2 ) Some methods use second-order information.
Recommend
More recommend