Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan | Boulder | 1 of 27
Quiz question For a test instance ( x , y ) and a naïve Bayes classifier ˆ P , which of the following statements is true? c ˆ • (A) � P ( y = c | x ) = 1 c ˆ • (B) � P ( x | c ) = 1 c ˆ P ( x | c )ˆ • (C) � P ( c ) = 1 Machine Learning: Chenhao Tan | Boulder | 2 of 27
Overview Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan | Boulder | 3 of 27
Reminder: Logistic Regression 1 P ( Y = 0 | X ) = (1) 1 + exp [ β 0 + � i β i X i ] exp [ β 0 + � i β i X i ] P ( Y = 1 | X ) = (2) 1 + exp [ β 0 + � i β i X i ] • Discriminative prediction: P ( y | x ) • Classification uses: sentiment analysis, spam detection • What we didn’t talk about is how to learn β from data Machine Learning: Chenhao Tan | Boulder | 4 of 27
Objective function Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan | Boulder | 5 of 27
Objective function Logistic Regression: Objective Function Maximize likelihood log P ( y ( j ) | x ( j ) , β )) � Obj ≡ log P ( Y | X , β ) = j � � � � �� β i x ( j ) β i x ( j ) � y ( j ) � � = β 0 + − log 1 + exp β 0 + i i j i i Machine Learning: Chenhao Tan | Boulder | 6 of 27
Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) log P ( y ( j ) | x ( j ) , β )) � L ≡ − log P ( Y | X , β ) = − j � � � � �� β i x ( j ) β i x ( j ) � − y ( j ) � � = β 0 + + log 1 + exp β 0 + i i j i i Machine Learning: Chenhao Tan | Boulder | 7 of 27
Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) log P ( y ( j ) | x ( j ) , β )) � L ≡ − log P ( Y | X , β ) = − j � � � � �� β i x ( j ) β i x ( j ) � − y ( j ) � � = β 0 + + log 1 + exp β 0 + i i j i i Training data { ( x , y ) } are fixed. Objective function is a function of β . . . what values of β give a good value? Machine Learning: Chenhao Tan | Boulder | 7 of 27
Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) log P ( y ( j ) | x ( j ) , β )) � L ≡ − log P ( Y | X , β ) = − j � � � � �� β i x ( j ) β i x ( j ) � − y ( j ) � � = β 0 + + log 1 + exp β 0 + i i j i i Training data { ( x , y ) } are fixed. Objective function is a function of β . . . what values of β give a good value? β ∗ = arg min L ( β ) β Machine Learning: Chenhao Tan | Boulder | 7 of 27
Objective function Convexity L ( β ) is convex for logistic regression. Proof. • Logistic loss − yv + log( 1 + exp( v )) is convex. • Composition with linear function maintains convexity. • Sum of convex functions is convex. Machine Learning: Chenhao Tan | Boulder | 8 of 27
Gradient Descent Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan | Boulder | 9 of 27
Gradient Descent Convexity • Convex function • Doesn’t matter where you start, if you go down along the gradient Machine Learning: Chenhao Tan | Boulder | 10 of 27
Gradient Descent Convexity • Convex function • Doesn’t matter where you start, if you go down along the gradient • Gradient! Machine Learning: Chenhao Tan | Boulder | 10 of 27
Gradient Descent Convexity • It would have been much harder if this is not convex. Machine Learning: Chenhao Tan | Boulder | 11 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Objective Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 2 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 2 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 2 3 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β j − η∂ L β l + 1 = β l j ∂β j Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β j − η∂ L β l + 1 = β l j ∂β j Luckily, (vanilla) logistic regression is convex Machine Learning: Chenhao Tan | Boulder | 12 of 27
Gradient Descent Gradient for Logistic Regression To ease notation, let’s define exp β T x i π i = (3) 1 + exp β T x i Our objective function is � − log π i if y i = 1 � � � L = − log p ( y i | x i ) = L i = (4) − log( 1 − π i ) if y i = 0 i i i Machine Learning: Chenhao Tan | Boulder | 13 of 27
Gradient Descent Taking the Derivative Apply chain rule: ∂π i − 1 ∂ L i ( � if y i = 1 ∂ L β ) π i ∂β j � � = = (5) � � − ∂π i ∂β j ∂β j 1 − if y i = 0 1 − π i ∂β j i i If we plug in the derivative, ∂π i = π i ( 1 − π i ) x j , (6) ∂β j we can merge these two cases ∂ L i = − ( y i − π i ) x j . (7) ∂β j Machine Learning: Chenhao Tan | Boulder | 14 of 27
Gradient Descent Gradient for Logistic Regression Gradient � ∂ L ( � , . . . , ∂ L ( � � β ) β ) ∇ β L ( � β ) = (8) ∂β 0 ∂β n Update ∆ β ≡ η ∇ β L ( � β ) (9) i ← β i − η∂ L ( � β ) β ′ (10) ∂β i Machine Learning: Chenhao Tan | Boulder | 15 of 27
Gradient Descent Gradient for Logistic Regression Gradient � � ∂ L ( � , . . . , ∂ L ( � β ) β ) ∇ β L ( � β ) = (8) ∂β 0 ∂β n Update ∆ β ≡ η ∇ β L ( � β ) (9) i ← β i − η∂ L ( � β ) β ′ (10) ∂β i Why are we subtracting? What would we do if we wanted to do ascent ? Machine Learning: Chenhao Tan | Boulder | 15 of 27
Gradient Descent Gradient for Logistic Regression Gradient � � ∂ L ( � , . . . , ∂ L ( � β ) β ) ∇ β L ( � β ) = (8) ∂β 0 ∂β n Update ∆ β ≡ η ∇ β L ( � β ) (9) i ← β i − η∂ L ( � β ) β ′ (10) ∂β i η : step size, must be greater than zero Machine Learning: Chenhao Tan | Boulder | 15 of 27
Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan | Boulder | 16 of 27
Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan | Boulder | 16 of 27
Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan | Boulder | 16 of 27
Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan | Boulder | 16 of 27
Gradient Descent Remaining issues • When to stop? • What if β keeps getting bigger? Machine Learning: Chenhao Tan | Boulder | 17 of 27
Gradient Descent Regularized Conditional Log Likelihood Unregularized β ∗ = arg min p ( y ( j ) | x ( j ) , β ) � � − ln (11) β Regularized + 1 β ∗ = arg min p ( y ( j ) | x ( j ) , β ) � � � β 2 − ln 2 µ (12) i β i Machine Learning: Chenhao Tan | Boulder | 18 of 27
Gradient Descent Regularized Conditional Log Likelihood Unregularized β ∗ = arg min � p ( y ( j ) | x ( j ) , β ) � − ln (11) β Regularized + 1 β ∗ = arg min � p ( y ( j ) | x ( j ) , β ) � � β 2 − ln 2 µ (12) i β i µ is “regularization” parameter that trades off between likelihood and having small parameters Machine Learning: Chenhao Tan | Boulder | 18 of 27
Recommend
More recommend