Modern Computational Statistics Lecture 2: Optimization Cheng Zhang School of Mathematical Sciences, Peking University September 11, 2019
Least Square Regression Models 2/38 ◮ Consider the following least square problem L ( β ) = 1 2 � Y − Xβ � 2 minimize ◮ Note that this is a quadratic problem, which can be solved by setting the gradient to zero ∇ β L ( β ) = − X T ( Y − X ˆ β ) = 0 β = ( X T X ) − 1 X T Y ˆ given that the Hessian is positive definite: ∇ 2 L ( β ) = X T X ≻ 0 which is true iff X has independent columns.
Regularized Regression Models 3/38 ◮ In practice, we would like to solve the least square problems with some constraints on the parameters to control the complexity of the resulting model ◮ One common approach is to use Bridge regression models (Frank and Friedman, 1993) L ( β ) = 1 2 � Y − Xβ � 2 minimize p | β j | γ ≤ s � subject to j =1 ◮ Two important special cases are ridge regression (Hoerl and Kennard, 1970) γ = 2 and Lasso (Tibshirani, 1996) γ = 1
General Optimization Problems 4/38 ◮ In general, optimization problems take the following form: minimize f 0 ( x ) subject to f i ( x ) ≤ 0 , i = 1 , . . . , m h j ( x ) = 0 , j = 1 , . . . , p ◮ We are mostly interested in convex optimization problems, where the objective function f 0 ( x ), the inequality constraints f i ( x ) and the equality constraints h j ( x ) are all convex functions.
Convex Sets 5/38 ◮ A set C is convex if the line segment between any two points in C also lies in C , i.e., θx 1 + (1 − θ ) x 2 ∈ C, ∀ x 1 , x 2 ∈ C, 0 ≤ θ ≤ 1 ◮ If C is a convex set in R n and f ( x ) : R n → R n is an affine function, then f ( C ), i.e., the image of C is also a convex set.
Convex Functions 6/38 ◮ A function f : R n → R is convex if its domain D f is a convex set, and ∀ x, y ∈ D f and 0 ≤ θ ≤ 1 f ( θx + (1 − θ ) y ) ≤ θf ( x ) + (1 − θ ) f ( y ) ◮ For example, many norms are convex functions � | x i | p ) 1 /p , � x � p = ( p ≥ 1 i
Convex Functions 7/38 ◮ First order conditions. Suppose f is differentiable, then f is convex iff D f is convex and f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) , ∀ x, y ∈ D f Corollary : For convex function f , f ( E ( X )) ≤ E ( f ( X )) ◮ Second order conditions. ∇ 2 f ( x ) � 0 , ∀ x ∈ D f
Basic Terminology and Notations 8/38 ◮ Optimial value p ∗ = inf { f 0 ( x ) | f i ( x ) ≤ 0 , h j ( x ) = 0 } m p ◮ x is feasible if x ∈ D = � D f i ∩ � D h j and satisfies the i =0 j =1 constraints. ◮ A feasible x ∗ is optimal if f ( x ∗ ) = p ∗ ◮ Optimality criterion. Assuming f 0 is convex and differentiable, x is optimal iff ∇ f 0 ( x ) T ( y − x ) ≥ 0 , ∀ feasible y Remark: for unconstrained problems, x is optimial iff ∇ f 0 ( x ) = 0
Basic Terminology and Notations 9/38 Local Optimality x is locally optimal if for a given R > 0, it is optimal for minimize f 0 ( z ) f i ( z ) ≤ 0 , subject to i = 1 , . . . , m h j ( z ) = 0 , j = 1 , . . . , p � z − x � ≤ R In convex optimization problems, any locally optimal point is also globally optimal.
The Lagrangian 10/38 ◮ Consider a general optimization problem minimize f 0 ( x ) f i ( x ) ≤ 0 , subject to i = 1 , . . . , m h j ( x ) = 0 , j = 1 , . . . , p ◮ To take the constraints into account, we augment the objective function with a weighted sum of the constraints and define the Lagrangian L : R n × R m × R p → R as m p � � L ( x, λ, ν ) = f 0 ( x ) + λ i f i ( x ) + ν j h j ( x ) i =1 j =1 where λ and ν are dual variables or Lagrangian multipliers .
The Lagrangian Dual Function 11/38 ◮ We define the Lagrangian dual function as follows g ( λ, ν ) = inf x ∈ D L ( x, λ, ν ) ◮ The dual function is the pointwise infimum of a family of affine functions of ( λ, ν ), it is concave, even when the original problem is not convex. ◮ If λ ≥ 0, for each feasible point ˜ x x ∈ D L ( x, λ, ν ) ≤ L (˜ x, λ, ν ) ≤ f 0 (˜ g ( λ, ν ) = inf x ) ◮ Therefore, g ( λ, ν ) is a lower bound for the optimial value ∀ λ ≥ 0 , ν ∈ R p g ( λ, ν ) ≤ p ∗ ,
The Lagrangian Dual Problem 12/38 ◮ Finding the best lower bound leads to the Lagrangian dual problem subject to λ ≥ 0 maximize g ( λ, ν ) , ◮ The above problem is a convex optimization problem. ◮ We denote the optimal value as d ∗ , and call the corresponding solution ( λ ∗ , ν ∗ ) the dual optimal ◮ In contrast, the original problem is called the primal problem, whose solution x ∗ is called primal optimal
Weak vs. Strong Duality 13/38 ◮ d ∗ is the best lower bound for p ∗ that can be obtained from the Lagrangian dual function. ◮ Weak Duality d ∗ ≤ p ∗ ◮ The difference p ∗ − d ∗ is called the optimal dual gap ◮ Strong Duality d ∗ = p ∗
Slater’s Condition 14/38 ◮ Strong duality doesn’t hold in general, but if the primal is convex, it usually holds under some conditions called constraint qualifications ◮ A simple and well-known constraint qualification is Slater’s condition: there exist an x in the relative interior of D such that f i ( x ) < 0 , i = 1 , . . . , m, Ax = b
Complementary Slackness 15/38 ◮ Consider primal optmial x ∗ and dual optimal ( λ ∗ , ν ∗ ) ◮ If strong duality holds f 0 ( x ∗ ) = g ( λ ∗ , ν ∗ ) m p � � � � λ ∗ v ∗ = inf f 0 ( x ) + i f i ( x ) + j h i ( x ) x i =1 i =1 m p � � ≤ f 0 ( x ∗ ) + λ ∗ i f i ( x ∗ ) + v ∗ j h i ( x ∗ ) i =1 i =1 ≤ f 0 ( x ∗ ) . ◮ Therefore, these are all equalities
Complementary Slackness 16/38 ◮ Important conclusions: ◮ x ∗ minimize L ( x, λ ∗ , ν ∗ ) ◮ λ ∗ i f i ( x ∗ ) = 0 , i = 1 , . . . , m ◮ The latter is called complementary slackness, which indicates λ ∗ ⇒ f i ( x ∗ ) = 0 i > 0 f i ( x ∗ ) < 0 λ ∗ ⇒ i = 0 ◮ When the dual problem is easier to solve, we can find ( λ ∗ , ν ∗ ) and then minimize L ( x, λ ∗ , ν ∗ ). If the resulting solution is primal feasible, then it is primal optimal.
Entropy Maximization 17/38 ◮ Consider the entropy maximization problem � n minimize f 0 ( x ) = i =1 x i log x i subject to − x i ≤ 0 , i = 1 , . . . , n � n i =1 x i = 1 ◮ Lagrangian n n n � � � L ( x, λ, ν ) = x i log x i − λ i x i + ν ( x i − 1) i =1 i =1 i =1 ◮ We minimize L ( x, λ, µ ) by setting ∂L ∂x to zero log ˆ x i + 1 − λ i + ν = 0 ⇒ ˆ x i = exp( λ i − ν − 1)
Entropy Maximization 18/38 ◮ The dual function is n � g ( λ, ν ) = − exp( λ i − ν − 1) − ν i =1 ◮ Dual: n � maximize g ( λ, ν ) = − exp( − ν − 1) exp( λ i ) − ν, λ ≥ 0 i =1 ◮ We find the dual optimal ν ∗ = − 1 + log n λ ∗ i = 0 , i = 0 , . . . , n,
Entropy Maximization 19/38 ◮ We now minimize L ( x, λ ∗ , ν ∗ ) i = 1 i + ν ∗ = 0 log x ∗ i + 1 − λ ∗ ⇒ x ∗ n ◮ Therefore, the discrete probability distribution that has maximum entropy is the uniform distribution Exercise Show that X ∼ N ( µ, σ 2 ) is the maximum entropy distribution such that EX = µ and EX 2 = µ 2 + σ 2 . How about fixing the first k moments at EX i = m i , i = 1 , . . . , k ?
Karush-Kun-Tucker (KKT) conditions 20/38 ◮ Suppose the functions f 0 , f 1 , . . . , f m , h 1 , . . . , h p are all differentiable; x ∗ and ( λ ∗ , ν ∗ ) are primal and dual optimal points with zero duality gap ◮ Since x ∗ minimize L ( x, λ ∗ , ν ∗ ), the gradient vanishes at x ∗ m p � � ∇ f 0 ( x ∗ ) + λ ∗ i ∇ f i ( x ∗ ) + ν ∗ i ∇ h j ( x ∗ ) = 0 i =1 j =1 ◮ Additionally f i ( x ∗ ) ≤ 0 , i = 1 , . . . , m h j ( x ∗ ) = 0 , j = 1 , . . . , p λ ∗ ≥ 0 , i = 1 , . . . , m i λ ∗ i f i ( x ∗ ) = 0 , i = 1 , . . . , m ◮ These are called Karush-Kuhn-Tucker (KKT) conditions
KKT conditions for convex problems 21/38 ◮ When the primal problem is convex, the KKT conditions are also sufficient for the points to be primal and dual optimal with zero duality gap. x, ˜ ◮ Let ˜ λ, ˜ ν be any points that satisfy the KKT conditions, ˜ x x, ˜ is primal feasible and minimizes L (˜ λ, ˜ ν ) g (˜ x, ˜ λ, ˜ ν ) = L (˜ λ, ˜ ν ) p m ˜ � � = f 0 (˜ x ) + λ i f i (˜ x ) + ν j h j (˜ ˜ x ) i =1 j =1 = f 0 (˜ x ) ◮ Therefore, for convex optimization problems with differentiable functions that satisfy Slater’s condition, the KKT condtions are necessary and sufficient
Example 22/38 ◮ Consider the following problem: 1 2 x T Px + q T x + r, P � 0 minimize subject to Ax = b ◮ KKT conditions: Px ∗ + q + A T ν ∗ = 0 Ax ∗ = b ◮ To find x ∗ , v ∗ , we can solve the above system of linear equations
Descent Methods 23/38 ◮ We now focus on numerical solutions for unconstrained optimization problems minimize f ( x ) where f : R n → R is twice differentiable ◮ Descent method. We can set up a sequence x ( k +1) = x ( k ) + t ( k ) ∆ x ( k ) , t ( k ) > 0 such that f ( x ( k +1) ) < f ( x ( k ) ) , k = 0 , 1 , . . . , ◮ ∆ x ( k ) is called the search direction; t ( k ) is called the step size or learning rate in machine learning.
Recommend
More recommend