CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University of Virginia
Overview 1. Gradient Descent 2. Stochastic Gradient Descent 3. SGD with Momentum 4. Adaptive Learning Rates 1
Gradient Descent
Learning as Optimization As discussed before, learning can be viewed as optimization problem. ◮ Training set S � {( x 1 , y 1 ) , . . . , ( x m , y m )} ◮ Empirical risk m L ( h θ , S ) � 1 � R ( h θ ( x i ) , y i ) (1) m i � 1 where R is the risk function ◮ Learning: minimize the empirical risk θ ← argmin L S ( h θ ′ , S ) (2) θ ′ 3
Learning as Optimization (II) Some examples of risk functions ◮ Logistic regression R ( h θ ( x i ) , y i ) � − log p ( y i | x i ; θ ) (3) 4
Learning as Optimization (II) Some examples of risk functions ◮ Logistic regression R ( h θ ( x i ) , y i ) � − log p ( y i | x i ; θ ) (3) ◮ Linear regression R ( h θ ( x i ) , y i ) � � h θ ( x i ) − y i � 2 (4) 2 4
Learning as Optimization (II) Some examples of risk functions ◮ Logistic regression R ( h θ ( x i ) , y i ) � − log p ( y i | x i ; θ ) (3) ◮ Linear regression R ( h θ ( x i ) , y i ) � � h θ ( x i ) − y i � 2 (4) 2 ◮ Neural network R ( h θ ( x i ) , y i ) � Cross-entropy ( h θ ( x i ) , y i ) (5) 4
Learning as Optimization (II) Some examples of risk functions ◮ Logistic regression R ( h θ ( x i ) , y i ) � − log p ( y i | x i ; θ ) (3) ◮ Linear regression R ( h θ ( x i ) , y i ) � � h θ ( x i ) − y i � 2 (4) 2 ◮ Neural network R ( h θ ( x i ) , y i ) � Cross-entropy ( h θ ( x i ) , y i ) (5) ◮ Percetpron and AdaBoost can also be viewed as minimizing certain loss functions 4
Constrained Optimization The dual optimization problem for SVMs of the separable cases is m m α i − 1 � � max α i α j y i y j � x i , x j � (6) 2 α i � 1 i , j � 1 s.t. α i ≥ 0 (7) m � α i y i � 0 ∀ i ∈ [ m ] (8) i � 1 5
Constrained Optimization The dual optimization problem for SVMs of the separable cases is m m α i − 1 � � max α i α j y i y j � x i , x j � (6) 2 α i � 1 i , j � 1 s.t. α i ≥ 0 (7) m � α i y i � 0 ∀ i ∈ [ m ] (8) i � 1 ◮ Lagrange multiplier α is also called dual variable ◮ This is an optimization problem only about α ◮ The dual problem is defined on the inner product � x i , x j � 5
Optimization via Gradient Descent The basic form of an optimization problem min f ( θ ) (9) s.t. θ ∈ B where f ( θ ) : R d → R is the objective function and B ⊆ R d is the constraint on θ , which usually can be formulated as a set of inequalities (e.g., SVM) 6
Optimization via Gradient Descent The basic form of an optimization problem min f ( θ ) (9) s.t. θ ∈ B where f ( θ ) : R d → R is the objective function and B ⊆ R d is the constraint on θ , which usually can be formulated as a set of inequalities (e.g., SVM) In this lecture ◮ we only focus on unconstrained optimization problem, in other words, θ ∈ R d ◮ assume f is convex and differentiable 6
Review: Gradient of a 1-D Function Consider the gradient of this 1-dimensional function y � f ( x ) � x 2 − x − 2 (10) 7
Review: Gradient of a 2-D Function Now, consider a 2-dimensional function with x � ( x 1 , x 2 ) y � f ( x ) � x 2 1 + 10 x 2 (11) 2 Here is the contour plot of this function We are going to use this as our running example 8
Gradient Descent To learn the parameter θ , the learning algorithm needs to update it iteratively using the following three steps 1. Choose an initial point θ ( 0 ) ∈ R d 2. Repeat θ ( t + 1 ) ← θ ( t ) − η t · ∇ f ( θ )| θ � θ ( t ) (12) where η t is the learning rate at time t 3. Go back step 1 until it converges 9
Gradient Descent To learn the parameter θ , the learning algorithm needs to update it iteratively using the following three steps 1. Choose an initial point θ ( 0 ) ∈ R d 2. Repeat θ ( t + 1 ) ← θ ( t ) − η t · ∇ f ( θ )| θ � θ ( t ) (12) where η t is the learning rate at time t 3. Go back step 1 until it converges ∇ f ( θ ) is defined as � ∂ f ( θ ) , · · · , ∂ f ( θ ) � ∇ f ( θ ) � (13) ∂θ 1 ∂θ d 9
Gradient Descent Interpretation An intuitive justification of the gradient descent algorithm is to consider the following plot The direction of the gradient is the direction that the 10 function has the “ fastest increase ”.
Gradient Descent Interpretation (II) Theoretical justification ◮ First-order Taylor approximation � f ( θ + ∆ θ ) ≈ f ( θ ) + � ∆ θ , ∇ f � (14) � θ 11
Gradient Descent Interpretation (II) Theoretical justification ◮ First-order Taylor approximation � f ( θ + ∆ θ ) ≈ f ( θ ) + � ∆ θ , ∇ f � (14) � θ � ◮ In gradient descent, ∆ θ � − η ∇ f � θ 11
Gradient Descent Interpretation (II) Theoretical justification ◮ First-order Taylor approximation � f ( θ + ∆ θ ) ≈ f ( θ ) + � ∆ θ , ∇ f � (14) � θ � ◮ In gradient descent, ∆ θ � − η ∇ f � θ ◮ Therefore, we have � f ( θ + ∆ θ ) ≈ f ( θ ) + � ∆ θ , ∇ f � � θ � f ( θ ) − � η ∇ f , ∇ f � � � θ � f ( θ ) − η �∇ f � 2 θ ≤ f ( θ ) (15) � � 2 11
Gradient Descent Interpretation (III) Consider the second-order Taylor approximation of f f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 ( θ ′ − θ ) T ∇ 2 f ( θ )( θ ′ − θ ) 12
Gradient Descent Interpretation (III) Consider the second-order Taylor approximation of f f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 ( θ ′ − θ ) T ∇ 2 f ( θ )( θ ′ − θ ) ◮ The quadratic approximation of f with the following f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 η ( θ ′ − θ ) T ( θ ′ − θ ) 12
Gradient Descent Interpretation (III) Consider the second-order Taylor approximation of f f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 ( θ ′ − θ ) T ∇ 2 f ( θ )( θ ′ − θ ) ◮ The quadratic approximation of f with the following f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 η ( θ ′ − θ ) T ( θ ′ − θ ) ◮ Minimize f ( θ ′ ) wrt θ ′ ∂ f ( θ ′ ) ∇ f ( θ ) + 1 2 η ( θ ′ − θ ) � 0 ≈ ∂ θ ′ θ ′ � θ − η · ∇ f ( θ ) ⇒ (16) 12
Gradient Descent Interpretation (III) Consider the second-order Taylor approximation of f f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 ( θ ′ − θ ) T ∇ 2 f ( θ )( θ ′ − θ ) ◮ The quadratic approximation of f with the following f ( θ ′ ) ≈ f ( θ ) + ∇ f ( θ )( θ ′ − θ ) + 1 2 η ( θ ′ − θ ) T ( θ ′ − θ ) ◮ Minimize f ( θ ′ ) wrt θ ′ ∂ f ( θ ′ ) ∇ f ( θ ) + 1 2 η ( θ ′ − θ ) � 0 ≈ ∂ θ ′ θ ′ � θ − η · ∇ f ( θ ) ⇒ (16) ◮ Gradient descent chooses the next point θ ′ to minimize the function 12
Step size θ ( t + 1 ) ← θ ( t ) − η t · ∂ f ( θ ) � (17) � ∂ θ � θ � θ ( t ) If choose fixed step size η t � η 0 , consider the following function f ( θ ) � ( 10 θ 2 1 + θ 2 2 )/ 2 (a) Too small 13
Step size θ ( t + 1 ) ← θ ( t ) − η t · ∂ f ( θ ) � (17) � ∂ θ � θ � θ ( t ) If choose fixed step size η t � η 0 , consider the following function f ( θ ) � ( 10 θ 2 1 + θ 2 2 )/ 2 (d) Too small (e) Too large 13
Step size θ ( t + 1 ) ← θ ( t ) − η t · ∂ f ( θ ) � (17) � ∂ θ � θ � θ ( t ) If choose fixed step size η t � η 0 , consider the following function f ( θ ) � ( 10 θ 2 1 + θ 2 2 )/ 2 (g) Too small (h) Too large (i) Just right 13
Optimal Step Sizes ◮ Exact Line Search Solve this one-dimensional subproblem t ← argmin f ( θ − s ∇ f ( θ )) (18) s ≥ 0 14
Optimal Step Sizes ◮ Exact Line Search Solve this one-dimensional subproblem t ← argmin f ( θ − s ∇ f ( θ )) (18) s ≥ 0 ◮ Backtracking Line Search : with parameters 0 < β < 1 , 0 < α ≤ 1 / 2 , and large initial value η t , if f ( θ − η ∇ f ( θ )) > f ( θ ) − αη t �∇ f ( θ )� 2 (19) 2 shrink η t ← βη t 14
Optimal Step Sizes ◮ Exact Line Search Solve this one-dimensional subproblem t ← argmin f ( θ − s ∇ f ( θ )) (18) s ≥ 0 ◮ Backtracking Line Search : with parameters 0 < β < 1 , 0 < α ≤ 1 / 2 , and large initial value η t , if f ( θ − η ∇ f ( θ )) > f ( θ ) − αη t �∇ f ( θ )� 2 (19) 2 shrink η t ← βη t ◮ Usually, this is not worth the effort, since the computational complexity may be too high (e.g., f is a neural network) 14
Convergence Analysis ◮ f is convex and differentiable, additionally �∇ f ( θ ) − ∇ f ( θ ′ )� 2 ≤ L · � θ − θ ′ � 2 (20) for any θ , θ ′ ∈ R d and L is a fixed positive value 15
Recommend
More recommend