Pattern Recognition Introduction to Gradient Descent Ad Feelders Universiteit Utrecht Ad Feelders (Universiteit Utrecht) Pattern Recognition 1 / 32
Optimization (single variable) Suppose we want to find the value of x for which the function y = f ( x ) is minimized (or maximized). From calculus we know that a necessary condition for a minimum is: df dx = 0 (1) This condition is not sufficient, since maxima and points of inflection also satisfy equation (1). Together with the second-order condition: d 2 f dx 2 > 0 (2) we have a sufficient condition for a local minimum. Ad Feelders (Universiteit Utrecht) Pattern Recognition 2 / 32
Optimization (single variable) The equation df dx = 0 may not have a closed form solution however. In such cases we have to resort to iterative numerical procedures such as gradient descent. Ad Feelders (Universiteit Utrecht) Pattern Recognition 3 / 32
Optimization (single variable) f ( x ) d f d x ( x = x ∗ ) x ∗ x The derivative at x = x ∗ is positive, so to increase the function value we should increase the value of x , i.e. make a step in the direction of the derivative. To decrease the function value, we should make a step in the opposite direction. Ad Feelders (Universiteit Utrecht) Pattern Recognition 4 / 32
Optimization (single variable) Also, the tangent line to the graph at x = x ∗ is a local linear approximation to f . ∆ f ≈ df dx ( x = x ∗ )∆ x The closer we are to x ∗ , the better the approximation. Ad Feelders (Universiteit Utrecht) Pattern Recognition 5 / 32
Gradient Descent Algorithm (single variable) The basic gradient-descent algorithm is: 1 Set i ← 0, and choose an initial value x (0) 2 determine the derivative df dx ( x = x ( i ) ) of f ( x ) at x ( i ) and update x ( i +1) ← x ( i ) − η df dx ( x = x ( i ) ) Set i ← i + 1. 3 Repeat the previous step until df dx = 0 and check if a (local) minimum has been reached. η > 0 is the step size (or learning rate ). Ad Feelders (Universiteit Utrecht) Pattern Recognition 6 / 32
Optimization (multiple variables) Suppose we want to find the values of x 1 , . . . , x m for which the function y = f ( x 1 , . . . , x m ) is minimized (or maximized). Analogous to the single-variable case a necessary condition for a minimum is: ∂ f = 0 j = 1 , . . . , m (3) ∂ x j Again this condition is not sufficient, since maxima and saddle points also satisfy (3). For the second order condition, define the Hessian matrix H , with ∂ 2 f H ij = ∂ x i ∂ x j Together with the second-order condition that H is positive definite, i.e. z ⊤ H z > 0 , for all z � = 0 (4) we have a sufficient condition for a local minimum. Ad Feelders (Universiteit Utrecht) Pattern Recognition 7 / 32
Linear Functions Consider a linear function m � b i x i = a + b ⊤ x f ( x ) = a + i =1 The contour lines of f are given by f ( x ) = a + b ⊤ x = c , for different values of the constant c . For linear functions the contours are parallel straight lines. Ad Feelders (Universiteit Utrecht) Pattern Recognition 8 / 32
The Gradient The gradient of f ( x 1 , x 2 , . . . , x m ) , is the vector of partial derivatives ∂ f ∂ x 1 ∂ f ∂ x 2 ∇ f = . . . ∂ f ∂ x m Ad Feelders (Universiteit Utrecht) Pattern Recognition 9 / 32
Gradient of a Linear Function The gradient of a linear function f ( x ) = a + b ⊤ x is given by ∇ f = b Furthermore, for linear functions we have: ∆ f = b ⊤ ∆ x = ∇ f ⊤ ∆ x In which direction should we move to maximize ∆ f ? Ad Feelders (Universiteit Utrecht) Pattern Recognition 10 / 32
The direction of steepest ascent (descent) is perpendicular to the contour line The direction of steepest ascent (descent) is an increasing (decreasing) direction perpendicular to the contour line. The direction of steepest ascent (descent) from the point x ∗ is where the contour line is tangent to a circle of radius one around x ∗ . Ad Feelders (Universiteit Utrecht) Pattern Recognition 11 / 32
The gradient is also perpendicular to the contour line Consider two points x A and x B both of which lie on the same contour line. Because f ( x A ) = f ( x B ) = c , we have f ( x A ) − f ( x B ) = 0 Therefore ( a + b ⊤ x A ) − ( a + b ⊤ x B ) = b ⊤ ( x A − x B ) = 0 and so the gradient is perpendicular to the contour line, because 1 The vector x A − x B runs parallel to the contour line. 2 Vectors are perpendicular if their dot product is zero. Ad Feelders (Universiteit Utrecht) Pattern Recognition 12 / 32
The gradient is also perpendicular to the contour line Ad Feelders (Universiteit Utrecht) Pattern Recognition 13 / 32
The gradient is perpendicular to the contour line For linear functions the direction of steepest increase is perpendicular to the contour line, as is the gradient. From ∆ f = b ⊤ ∆ x = ∇ f ⊤ ∆ x we conclude that the gradient points in an increasing direction, since filling in ∇ f for ∆ x gives ∆ f = ∇ f ⊤ ∇ f = �∇ f � 2 Therefore: 1 The gradient points in the direction of fastest increase of f . 2 Minus the gradient points in the direction of fastest decrease of f . Ad Feelders (Universiteit Utrecht) Pattern Recognition 14 / 32
Linear Approximation This reasoning works for arbitrary functions by considering a local linear approximation to the function at x ∗ by the tangent plane: ( y − y ∗ ) = ∂ f 1 ) + ∂ f ( x ∗ )( x 1 − x ∗ ( x ∗ )( x 2 − x ∗ 2 ) , ∂ x 1 ∂ x 2 and using the linear approximation ∆ f ≈ ∂ f ( x ∗ )∆ x 1 + ∂ f ( x ∗ )∆ x 2 = ∇ f ⊤ ( x ∗ )∆ x . ∂ x 1 ∂ x 2 Here ∂ f ( x ∗ ) and ∂ f ( x ∗ ) ∂ x 1 ∂ x 2 are the slopes of the tangent lines in the direction of x 1 resp. x 2 at the point x = x ∗ . Ad Feelders (Universiteit Utrecht) Pattern Recognition 15 / 32
Local Linear Approximation by Tangent Plane The white dot represents the point ( x ∗ , f ( x ∗ )). Ad Feelders (Universiteit Utrecht) Pattern Recognition 16 / 32
Gradient Descent Algorithm (multivariable) The basic gradient-descent algorithm is: 1 Set i ← 0, and choose an initial value x (0) 2 determine the gradient ∇ f ( x ( i ) ) of f ( x ) at x ( i ) and update x ( i +1) ← x ( i ) − η ∇ f ( x ( i ) ) Set i ← i + 1. 3 Repeat the previous step until ∇ f ( x ( i ) ) = 0 and check if a (local) minimum has been reached. η > 0 is the step size (or learning rate ). Ad Feelders (Universiteit Utrecht) Pattern Recognition 17 / 32
Example of gradient descent for linear regression Note: w 0 and w 1 are the variables here! n x t y = w 0 + w 1 x e = t − y 1 0 1 w 0 1 − w 0 2 1 3 w 0 + w 1 3 − w 0 − w 1 3 2 4 w 0 + 2 w 1 4 − w 0 − 2 w 1 4 3 3 w 0 + 3 w 1 3 − w 0 − 3 w 1 5 4 5 w 0 + 4 w 1 5 − w 0 − 4 w 1 (1 − w 0 ) 2 + (3 − w 0 − w 1 ) 2 SSE( w 0 , w 1 ) = +(4 − w 0 − 2 w 1 ) 2 + (3 − w 0 − 3 w 1 ) 2 +(5 − w 0 − 4 w 1 ) 2 Ad Feelders (Universiteit Utrecht) Pattern Recognition 18 / 32
Example of gradient descent ∂ SSE = [2(1 − w 0 )( − 1)] + [2(3 − w 0 − w 1 )( − 1)] ∂ w 0 + [2(4 − w 0 − 2 w 1 )( − 1)] + [2(3 − w 0 − 3 w 1 )( − 1)] + [2(5 − w 0 − 4 w 1 )( − 1)] = − 32 + 10 w 0 + 20 w 1 ∂ SSE = 0 + [2(3 − w 0 − w 1 )( − 1)] ∂ w 1 + [2(4 − w 0 − 2 w 1 )( − 2)] + [2(3 − w 0 − 3 w 1 )( − 3)] + [2(5 − w 0 − 4 w 1 )( − 4)] = − 80 + 20 w 0 + 60 w 1 Ad Feelders (Universiteit Utrecht) Pattern Recognition 19 / 32
Example of gradient descent So the gradient is: ∂ SSE � − 32 + 10 w 0 + 20 w 1 � ∂ w 0 = ∇ SSE = − 80 + 20 w 0 + 60 w 1 ∂ SSE ∂ w 1 Let w (0) = (0 , 0). Then the gradient evaluated in the point w (0) is: � − 32 + 10 × 0 + 20 × 0 � − 32 � � ∇ SSE( w (0) ) = = − 80 + 20 × 0 + 60 × 0 − 80 Ad Feelders (Universiteit Utrecht) Pattern Recognition 20 / 32
Example of gradient descent Let η = 1 50 . Then we get the following update: − η∂ SSE = 0 − 1 w (1) = w (0) 50 × − 32 = 0 . 64 0 0 ∂ w 0 − η∂ SSE = 0 − 1 w (1) = w (0) 50 × − 80 = 1 . 6 1 1 ∂ w 1 Or both at once: � 0 � − 32 � 0 . 64 � − 1 � � w (1) = w (0) − η ∇ SSE( w (0) ) = = 0 − 80 1 . 6 50 Ad Feelders (Universiteit Utrecht) Pattern Recognition 21 / 32
Gradient Descent with step size η = 0 . 02 2.0 20 30 4 5 0 0 10 ● 1.5 5 4 4 ● ● ● ● ● ● ● ●●●●●●● ● 3 1.0 b1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 1 0 5 40 30 0.0 2 0 0 ● 0.0 0.5 1.0 1.5 2.0 b0 Ad Feelders (Universiteit Utrecht) Pattern Recognition 22 / 32
Recommend
More recommend