Convex Optimization 9. Unconstrained minimization Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao Tong University 2017 Autumn Semester SJTU Ying Cui 1 / 40
Outline Unconstrained minimization problems Descent methods Gradient descent method Steepest descent method Newton’s method Self-concordance Implementation SJTU Ying Cui 2 / 40
Unconstrained minimization min f ( x ) x assumptions: ◮ assume f : R n → R is convex, twice continuously differentiable (implying that dom f is open) ◮ assume there exists an optimal point x ∗ (optimal value p ∗ = inf x f ( x ) is attained and finite) a necessary and sufficient condition for optimality: ∇ f ( x ∗ ) = 0 ◮ solving unconstrained minimization problem is the same as finding a solution of optimality equation ◮ in a few special cases, can be solved analytically ◮ usually, must be solved by an iterative algorithm ◮ produce a sequence of points x ( k ) ∈ dom f, k = 0 , 1 , ... with f ( x ( k ) ) → p ∗ , as k → ∞ ◮ terminated when f ( x ( k ) ) − p ∗ ≤ ǫ for some tolerance ǫ > 0 SJTU Ying Cui 3 / 40
Initial point and sublevel set algorithms in this chapter require a starting point x (0) such that ◮ x (0) ∈ dom f ◮ sublevel set S = { x | f ( x ) ≤ f ( x (0) ) } is closed (hard to verify) 2nd condition is satisfied for all x (0) ∈ dom f if f is closed, i.e., all sublevel sets are closed, equivalent to epi f is closed ◮ true if f is continuous and dom f = R n ◮ true if f ( x ) → ∞ as x → bd dom f examples of differentiable functions with closed sublevel sets: m m � � exp( a T log( b i − a T f ( x ) = log( i x + b i )) , f ( x ) = − i x ) i =1 i =1 SJTU Ying Cui 4 / 40
Strong convexity and implications f is strongly convex on S if there exists an m > 0 such that ∇ 2 f ( x ) � mI for all x ∈ S implications ◮ for x, y ∈ S , f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 || x − y || 2 2 ◮ m = 0 : recover the basic inequality characterizing convexity ◮ m > 0 : a better lower bound than follows from convexity alone ◮ imply that S is bounded ◮ p ∗ > −∞ and for x ∈ S , f ( x ) − p ∗ ≤ 2 m ||∇ f ( x ) || 2 1 2 ◮ if gradient is small at a point, then the point is nearly optimal ◮ a condition for suboptimality generalizing optimality condition ||∇ f ( x ) || 2 ≤ (2 mǫ ) 1 / 2 = ⇒ f ( x ) − p ∗ ≤ ǫ ◮ useful as a stopping criterion if m is known ◮ upper bound on ∇ f ( x ) : there exists an M > 0 such that ∇ 2 f ( x ) � MI for all x ∈ S SJTU Ying Cui 5 / 40
Condition number of matrix and convex set ◮ condition number of a matrix: the ratio of its largest eigenvalue to its smallest eigenvalue ◮ condition number of a convex set: square of the ratio of its maximum width to its minimum width ◮ width of a convex set C in the direction q with || q || 2 = 1 : W ( C, q ) = sup z ∈ C q T z − inf z ∈ C q T z ◮ minimum width and maximum width of C : W min = inf || q || 2 =1 W ( C, q ) and W max = sup || q || 2 =1 W ( C, q ) ◮ condition number of C : cond ( C ) = W 2 max W 2 min ◮ a measure of its anisotropy or eccentricity: cond ( C ) small means C has approximately the same width in all directions (nearly spherical); cond ( C ) large means that C is far wider in some directions than in others SJTU Ying Cui 6 / 40
Condition number of sublevel sets mI � ∇ 2 f ( x ) � MI for all x ∈ S ◮ upper bound of condition number of ∇ 2 f ( x ) : cond ( ∇ 2 f ( x )) ≤ M/m ◮ upper bound of condition number of sublevel set C α = { x | f ( x ) ≤ α } , p ∗ < α ≤ f ( x (0) ) : cond ( C α ) ≤ M/m ◮ geometric interpretation: α → p ∗ cond ( C α ) = cond ( ∇ 2 f ( x ∗ )) lim ◮ condition number of the sublevel sets of f (which is bounded by M/m ) has a strong effect on the efficiency of some common methods for unconstrained minimization SJTU Ying Cui 7 / 40
Descent methods algorithms described in this chapter produce a minimizing sequence x ( k ) , k = 1 , · · · , where x ( k +1) = x ( k ) + t ( k ) ∆ x ( k ) with f ( x ( k +1) ) < f ( x ( k ) ) and t ( k ) > 0 ◮ other notations: x + = x + t ∆ x, x := x + t ∆ x ◮ ∆ x is step (or search direction); t is step size (or step length) ◮ convexity of f implies ∇ f ( x ( k ) ) T ∆ x ( k ) < 0 (i.e., ∆ x ( k ) is a descent direction) General descent method . given a starting point x ∈ dom f . repeat 1. Determine a descent direction ∆ x . 2. Line search. Choose a step size t > 0 . 3. Update. x := x + t ∆ x until stopping criterion is satisfied. SJTU Ying Cui 8 / 40
Line search types exact line search : t = arg min t> 0 f ( x + t ∆ x ) ◮ minimize f along ray { x + t ∆ x | t ≥ 0 } ◮ used when cost of the minimization problem with one variable is low compared to the cost of computing the search direction itself ◮ in some special cases the minimizer can be found analytically, and in others it can be computed efficiently SJTU Ying Cui 9 / 40
Line search types backtracking line search (with parameters α ∈ (0 , 1 2 ) , β ∈ (0 , 1) ) ◮ reduce f enough along ray { x + t ∆ x | t ≥ 0 } ◮ starting at t = 1 , repeat t := βt until f ( x + t ∆ x ) < f ( x ) + αt ∇ f ( x ) T ∆ x ◮ convexity of f : f ( x + t ∆ x ) ≥ f ( x ) + t ∇ f ( x ) T ∆ x ◮ constant α can be interpreted as the fraction of decrease in f predicted by linear extrapolation that we will accept ◮ graphical interpretation: backtrack until t ≤ t 0 f ( x + t ∆ x ) f ( x ) + t ∇ f ( x ) T ∆ x f ( x ) + αt ∇ f ( x ) T ∆ x t t = 0 t 0 Figure 9.1 Backtracking line search. The curve shows f , restricted to the line over which we search. The lower dashed line shows the linear extrapolation of f , and the upper dashed line has a slope a factor of α smaller. The backtracking condition is that f lies below the upper dashed line, i.e. , 0 ≤ t ≤ t 0 . SJTU Ying Cui 10 / 40
Gradient descent method general descent method with ∆ x = −∇ f ( x ) Gradient descent method . given a starting point x ∈ dom f . repeat 1. ∆ x := −∇ f ( x ) . 2. Line search. Choose step size t via exact or backtracking line search. 3. Update. x := x + t ∆ x . until stopping criterion is satisfied. ◮ stopping criterion usually of the form ||∇ f ( x ) || 2 ≤ ǫ ◮ convergence result: for strongly convex f , f ( x ( k ) ) − p ∗ ≤ c k ( f ( x (0) ) − p ∗ ) ◮ exact line search: c = 1 − m/M < 1 ◮ backtracking line search: c = 1 − min { 2 mα, 2 βαm/M } < 1 ◮ linear convergence: the error lies below a line on a log-linear plot of error versus iteration number ◮ very simple, but often very slow; rarely used in practice SJTU Ying Cui 11 / 40
Examples a quadratic problem in R 2 f ( x ) = (1 / 2)( x 2 1 + γx 2 2 ) ( γ > 0) with exact line search, starting at x (0) = ( γ, 1) : closed-form expressions for iterates � γ − 1 � k � � k � γ − 1 � 2 k − γ − 1 x ( k ) , x ( k ) , f ( x ( k ) ) = f ( x (0) ) = γ = γ 1 2 γ + 1 γ + 1 γ + 1 ◮ exact solution found in one iteration if γ = 1 ; convergence rapid if γ not far from 1; convergence very slow if γ ≫ 1 or γ ≪ 1 4 x (0) x 2 0 x (1) − 4 − 10 0 10 x 1 Figure 9.2 Some contour lines of the function f ( x ) = (1 / 2)( x 2 1 + 10 x 2 2 ). The condition number of the sublevel sets, which are ellipsoids, is exactly 10. The figure shows the iterates of the gradient method with exact line search, started at x (0) = (10 , 1). SJTU Ying Cui 12 / 40
Examples a nonquadratic problem in R 2 f ( x 1 , x 2 ) = e x 1 +3 x 2 − 0 . 1 + e x 1 − 3 x 2 − 0 . 1 + e − x 1 − 0 . 1 ◮ backtracking line search: approximately linear convergence (sublevel sets of f not too badly conditioned, M/m not too large) ◮ exact line search: approximately linear convergence, about twice as fast as with backtracking line search 10 5 10 0 x (0) f ( x ( k ) ) − p ⋆ backtracking l.s. x (2) 10 − 5 x (0) 10 − 10 exact l.s. x (1) x (1) 10 − 15 0 5 10 15 20 25 k Figure 9.4 Error f ( x ( k ) ) − p ⋆ versus iteration k of the gradient method with Figure 9.3 Iterates of the gradient method with backtracking line search, backtracking and exact line search, for the problem in R 2 with objective f for the problem in R 2 with objective f given in (9.20). The dashed curves given in (9.20). The plot shows nearly linear convergence, with the error are level curves of f , and the small circles are the iterates of the gradient method. The solid lines, which connect successive iterates, show the scaled reduced approximately by the factor 0 . 4 in each iteration of the gradient steps t ( k ) ∆ x ( k ) . Figure 9.5 Iterates of the gradient method with exact line search for the method with backtracking line search, and by the factor 0 . 2 in each iteration problem in R 2 with objective f given in (9.20). of the gradient method with exact line search. SJTU Ying Cui 13 / 40
Examples a problem in R 100 500 � f ( x ) = c T x − log( b i − a T i x ) i =1 ◮ backtracking line search: approximately linear convergence ◮ exact line search: approximately linear convergence, only a bit faster than with backtracking line search 10 4 10 2 f ( x ( k ) ) − p ⋆ 10 0 exact l.s. 10 − 2 backtracking l.s. 10 − 4 0 50 100 150 200 k Figure 9.6 Error f ( x ( k ) ) − p ⋆ versus iteration k for the gradient method with backtracking and exact line search, for a problem in R 100 . SJTU Ying Cui 14 / 40
Recommend
More recommend