Elements of differential calculus and optimization. Joan Alexis Glaun` es October 24, 2019 1/29
Differential Calculus in R n partial derivatives Partial derivatives of a real-valued function defined on R n : f : R n → R . ◮ example : f : R 2 → R , � ∂ f ∂ x 1 ( x 1 , x 2 ) = 4( x 1 − 1) + x 2 f ( x 1 , x 2 ) = 2( x 1 − 1) 2 + x 1 x 2 + x 2 ⇒ 2 ∂ f ∂ x 2 ( x 1 , x 2 ) = x 1 + 2 x 2 ◮ example : f : R n → R , f ( x ) = f ( x 1 , . . . , x n ) = ( x 2 − x 1 ) 2 + ( x 3 − x 2 ) 2 + · · · + ( x n − x n − 1 ) 2 ∂ f ∂ x 1 ( x ) = 2( x 1 − x 2 ) ∂ f ∂ x 2 ( x ) = 2( x 2 − x 1 ) + 2( x 2 − x 3 ) ∂ f ∂ x 3 ( x ) = 2( x 3 − x 2 ) + 2( x 3 − x 4 ) ⇒ · · · ∂ f ∂ x n − 1 ( x ) = 2( x n − 1 − x n − 2 ) + 2( x n − 1 − x n ) ∂ f ∂ x n ( x ) = 2( x n − x n − 1 ) 2/29
Differential Calculus in R n Directional derivatives ◮ Let x , h ∈ R n . We can look at the derivative of f at x in the direction h . It is defined as f ( x + ε h ) − f ( x ) f ′ h ( x ) := lim , ε ε → 0 i.e. f h ( x ) = g ′ (0) where g ( ε ) = f ( x + ε h ) (the restriction of f along the line passing through x with direction h . ◮ The partial derivatives are in fact the directional derivatives in the directions of the canonical basis e i = (0 , . . . , 1 , 0 , . . . , 0) : ∂ f = f ′ e i ( x ) . ∂ x i 3/29
Differential Calculus in R n Differential form and Jacobian matrix ◮ The application that maps any direction h to f ′ h ( x ) is a linear map from R n to R . It is called the differential form of f at x , and denoted f ′ ( x ) or Df ( x ). Its matrix in the canonical basis is called the Jacobian matrix at x . It is a 1 × n matrix whose coefficients are simply the partial derivatives : � ∂ f ( x ) , . . . , ∂ f � Jf ( x ) = ( x ) . ∂ x 1 ∂ x n ◮ Hence one gets the expression of the directional derivative in any direction h = ( h 1 , . . . , h n ) by multiplying this Jacobian matrix with the column vector of the h i : Jf ( x ) × h = ∂ f ( x ) h 1 + · · · + ∂ f f ′ h ( x ) = f ′ ( x ) . h = ( x ) h n (1) ∂ x 1 ∂ x n n ∂ f � = ( x ) h i . (2) ∂ x i i =1 4/29
Differential Calculus in R n Differential form and Jacobian matrix ◮ More generally, if f : R n → R p , f = ( f 1 , . . . , f p ) one defines the differential of f , f ′ ( x ) or Df ( x ) as the linear map from R n to R p whose matrix in the canonical basis is ∂ f 1 ∂ f 1 ∂ x 1 ( x ) · · · ∂ x n ( x ) Jf ( x ) = · · · · · · · · · ∂ f p ∂ f p ∂ x 1 ( x ) · · · ∂ x n ( x ) 5/29
Differential Calculus in R n Differential form and Jacobian matrix Some rule of differentiation ◮ linearity: if f ( x ) = au ( x ) + bv ( x ), with u and v two functions and a , b two real numbers, then f ′ ( x ) . h = au ′ ( x ) . h + bv ′ ( x ) . h . ◮ The chain rule: if f : R n → R is a composition of two functions v : R n → R p and u : R p → R : f ( x ) = u ( v ( x )), then one has f ′ ( x ) . h = ( u ◦ v ) ′ ( x ) . h = u ′ ( v ( x )) . v ′ ( x ) . h 6/29
Differential Calculus in R n Gradient ◮ If f : R n → R , the matrix multiplication Jf ( x ) × h can be viewed also as a scalar product between the vector h and the vector of partial derivatives. We call this vector of partial derivatives the gradient of f at x , denoted ∇ f ( x ). n ∂ f f ′ ( x ) . h = � ( x ) h i = �∇ f ( x ) , h � . ∂ x i i =1 ◮ Hence we get three different equivalent ways for computing a derivative of a function : either as a directional derivative, or using the differential form notation, or using the partial derivatives. 7/29
Differential Calculus in R n Example i =1 ( x i +1 − x i ) 2 : Example with f ( x ) = � n − 1 ◮ Using directional derivatives : we write n − 1 � ( x i +1 − x i + ε ( h i +1 − h i )) 2 g ( ε ) = f ( x + ε h ) = i =1 n − 1 g ′ ( ε ) = 2 � ( x i +1 − x i + ε ( h i +1 − h i )) ( h i +1 − h i ) i =1 n − 1 � f ′ ( x ) . h = g ′ (0) = 2 ( x i +1 − x i ) ( h i +1 − h i ) i =1 8/29
Differential Calculus in R n Example ◮ Using differential forms : we write n − 1 � ( x i +1 − x i ) 2 f ( x ) = i =1 n − 1 f ′ ( x ) = 2 � ( x i +1 − x i ) ( dx i +1 − dx i ) i =1 where dx i denotes the differential form of the coordinate function x �→ x i which is simply dx i . h = h i . ◮ Applying this differential form to a vector h we retrieve n − 1 � f ′ ( x ) . h = 2 ( x i +1 − x i ) ( h i +1 − h i ) i =1 9/29
Differential Calculus in R n Example ◮ Using partial derivatives : we write n ∂ f f ′ ( x ) . h = f ′ � h ( x ) = ( x ) h i ∂ x i i =1 = 2( x 1 − x 2 ) h 1 + (2( x 2 − x 1 ) + 2( x 2 − x 3 )) h 2 + . . . + 2( x n − x n − 1 ) h n Arranging terms differently we get finally the same formula: n − 1 f ′ ( x ) . h = 2 � ( x i +1 − x i ) ( h i +1 − h i ) i =1 ◮ This calculus is less straightforward because we first identified terms corresponding to each h i to compute the partial derivatives, and then grouped terms back to the original summation. 10/29
Differential Calculus in R n Example Corresponding Matlab codes : these two codes compute the gradient of f (they give exactly the same result) : ◮ Code that follows the partial derivatives calculus : we compute the partial derivative ∂ f ∂ x i ( x ) for each i and put it in the coefficient i of the gradient. f u n c t i o n G = g r a d i e n t f ( x ) n = length ( x ) ; G = ze ros (n , 1 ) ; G(1) = 2 ∗ ( x(1) − x ( 2 ) ) ; f o r i =2:n − 1 G( i ) = 2 ∗ ( x ( i ) − x ( i − 1)) + 2 ∗ ( x ( i ) − x ( i +1)); end G(n) = 2 ∗ ( x (n) − x (n − 1)); end 11/29
Differential Calculus in R n Example ◮ Code that follows the differential form calculus : we compute coefficients appearing in the summation and incrementally fill the corresponding coefficients of the gradient f u n c t i o n G = g r a d i e n t f ( x ) n = length ( x ) ; G = ze ros (n , 1 ) ; f o r i =1:n − 1 c = 2 ∗ ( x ( i +1) − x ( i ) ) ; G( i +1) = G( i +1) + c ; G( i ) = G( i ) − c ; end end ◮ This second code is better because it only requires the differential form, and also because it is faster : at each step in the loop, only one coefficient 2( x i +1 − x i ) is computed instead of two. 12/29
Gradient descent Gradient descent algorithm ◮ Let f : R n → R be a function. The gradient of f gives the direction in which the function increases the most. Conversely the opposite of the gradient gives the direction in which the function decreases the most. ◮ Hence the idea of gradient descent is to start from a given vector x 0 = ( x 0 n ), move from x 0 with a small step in the direction 1 , x 0 2 , . . . , x 0 −∇ f ( x 0 ), recompute the gradient at the new position x 1 and move again in the −∇ f ( x 1 ) direction, and repeat this process a large number of times to finally get to the position for which f has a minimal value. ◮ Gradient descent algorithm : choose initial position x 0 ∈ R n and stepsize η > 0, and compute iteratively the sequence x k +1 = x k − η ∇ f ( x k ) . ◮ The convergence of the sequence to a minimizer of the function depends on properties of the function and the choice of η (see later). 13/29
Gradient descent Gradient descent algorithm 14/29
Taylor expansion First order Taylor expansion of a function ◮ Let f : R n → R . The first-order Taylor expansion at point x ∈ R d writes f ( x + h ) = f ( x ) + � h , ∇ f ( x ) � + o ( � h � ) , or equivalently n ∂ f � f ( x + h ) = f ( x ) + h i ( x ) + o ( � h � ) . ∂ x i i =1 ◮ This means f is approximated by a linear map locally around point x . 15/29
Taylor expansion Hessian and second-order Taylor expansion ◮ The Hessian matrix of a function f is the matrix of second-order partial derivatives : ∂ 2 f ∂ 2 f 1 ( x ) · · · ∂ x 1 ∂ x n ( x ) ∂ x 2 . . . . Hf ( x ) = . . ∂ 2 f ∂ 2 f ∂ x 1 ∂ x n ( x ) · · · n ( x ) ∂ x 2 ◮ The second-order Taylor expansion writes f ( x + h ) = f ( x ) + � h , ∇ f ( x ) � + 1 2 h T Hf ( x ) h + o ( � h � 2 ) , where h is taken as a column vector and h T is its transpose (row vector). ◮ Developing this formula gives n n n ∂ 2 f ∂ f ( x ) + 1 � � � ( x ) + o ( � h � 2 ) . f ( x + h ) = f ( x ) + h i h i h j ∂ x i 2 ∂ x i ∂ x j i =1 i =1 j =1 16/29
Taylor expansion Taylor expansion 17/29
Optimality conditions 1st order optimality condition ◮ If x is a local minimizer of f , i.e. f ( x ) ≤ f ( y ) for any y in a small neighbourhood of x , then ∇ f ( x ) = 0 . ◮ A point x that satisfies ∇ f ( x ) = 0 is called a critical point. So every local minimizer is a critical point, but the converse is false. ◮ In fact we distinguish three types of critical points: local minimizers, local maximizers, and saddle points (saddle points are just critical points that are neither local minimizers or maximizers). ◮ Generally the analysis of the hessian matrix allows to distinguish between these three types (see next slide) 18/29
Recommend
More recommend