Machine Learning: Chenhao Tan University of Colorado Boulder - PowerPoint PPT Presentation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan | Boulder | 1 of 27

Quiz question For a test instance ( x , y ) and a naïve Bayes classifier ˆ P , which of the following statements is true? c ˆ • (A) � P ( y = c | x ) = 1 c ˆ • (B) � P ( x | c ) = 1 c ˆ P ( x | c )ˆ • (C) � P ( c ) = 1 Machine Learning: Chenhao Tan | Boulder | 2 of 27

Overview Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan | Boulder | 3 of 27

Reminder: Logistic Regression 1 P ( Y = 0 | X ) = (1) 1 + exp [ β 0 + � i β i X i ] exp [ β 0 + � i β i X i ] P ( Y = 1 | X ) = (2) 1 + exp [ β 0 + � i β i X i ] • Discriminative prediction: P ( y | x ) • Classification uses: sentiment analysis, spam detection • What we didn’t talk about is how to learn β from data Machine Learning: Chenhao Tan | Boulder | 4 of 27

Objective function Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan | Boulder | 5 of 27

Objective function Logistic Regression: Objective Function Maximize likelihood log P ( y ( j ) | x ( j ) , β )) � Obj ≡ log P ( Y | X , β ) = j � � � � �� β i x ( j ) β i x ( j ) � y ( j ) � � = β 0 + − log 1 + exp β 0 + i i j i i Machine Learning: Chenhao Tan | Boulder | 6 of 27

Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) log P ( y ( j ) | x ( j ) , β )) � L ≡ − log P ( Y | X , β ) = − j � � � � �� β i x ( j ) β i x ( j ) � − y ( j ) � � = β 0 + + log 1 + exp β 0 + i i j i i Machine Learning: Chenhao Tan | Boulder | 7 of 27

Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) log P ( y ( j ) | x ( j ) , β )) � L ≡ − log P ( Y | X , β ) = − j � � � � �� β i x ( j ) β i x ( j ) � − y ( j ) � � = β 0 + + log 1 + exp β 0 + i i j i i Training data { ( x , y ) } are fixed. Objective function is a function of β . . . what values of β give a good value? Machine Learning: Chenhao Tan | Boulder | 7 of 27

Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) log P ( y ( j ) | x ( j ) , β )) � L ≡ − log P ( Y | X , β ) = − j � � � � �� β i x ( j ) β i x ( j ) � − y ( j ) � � = β 0 + + log 1 + exp β 0 + i i j i i Training data { ( x , y ) } are fixed. Objective function is a function of β . . . what values of β give a good value? β ∗ = arg min L ( β ) β Machine Learning: Chenhao Tan | Boulder | 7 of 27

Objective function Convexity L ( β ) is convex for logistic regression. Proof. • Logistic loss − yv + log( 1 + exp( v )) is convex. • Composition with linear function maintains convexity. • Sum of convex functions is convex. Machine Learning: Chenhao Tan | Boulder | 8 of 27

Gradient Descent Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan | Boulder | 9 of 27

Gradient Descent Convexity • Convex function • Doesn’t matter where you start, if you go down along the gradient Machine Learning: Chenhao Tan | Boulder | 10 of 27

Gradient Descent Convexity • Convex function • Doesn’t matter where you start, if you go down along the gradient • Gradient! Machine Learning: Chenhao Tan | Boulder | 10 of 27

Gradient Descent Convexity • It would have been much harder if this is not convex. Machine Learning: Chenhao Tan | Boulder | 11 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Objective Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 2 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β 0 Objective 1 2 3 Undiscovered Objective Country Parameter Machine Learning: Chenhao Tan | Boulder | 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β j − η∂ L β l + 1 = β l j ∂β j Machine Learning: Chenhao Tan | Boulder | 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β j − η∂ L β l + 1 = β l j ∂β j Luckily, (vanilla) logistic regression is convex Machine Learning: Chenhao Tan | Boulder | 12 of 27

Gradient Descent Gradient for Logistic Regression To ease notation, let’s define exp β T x i π i = (3) 1 + exp β T x i Our objective function is � − log π i if y i = 1 � � � L = − log p ( y i | x i ) = L i = (4) − log( 1 − π i ) if y i = 0 i i i Machine Learning: Chenhao Tan | Boulder | 13 of 27

Gradient Descent Taking the Derivative Apply chain rule:  ∂π i − 1 ∂ L i ( � if y i = 1 ∂ L β )  π i ∂β j � � = = (5) � � − ∂π i ∂β j ∂β j 1 − if y i = 0  1 − π i ∂β j i i If we plug in the derivative, ∂π i = π i ( 1 − π i ) x j , (6) ∂β j we can merge these two cases ∂ L i = − ( y i − π i ) x j . (7) ∂β j Machine Learning: Chenhao Tan | Boulder | 14 of 27

Gradient Descent Gradient for Logistic Regression Gradient � ∂ L ( � , . . . , ∂ L ( � � β ) β ) ∇ β L ( � β ) = (8) ∂β 0 ∂β n Update ∆ β ≡ η ∇ β L ( � β ) (9) i ← β i − η∂ L ( � β ) β ′ (10) ∂β i Machine Learning: Chenhao Tan | Boulder | 15 of 27

Gradient Descent Gradient for Logistic Regression Gradient � � ∂ L ( � , . . . , ∂ L ( � β ) β ) ∇ β L ( � β ) = (8) ∂β 0 ∂β n Update ∆ β ≡ η ∇ β L ( � β ) (9) i ← β i − η∂ L ( � β ) β ′ (10) ∂β i Why are we subtracting? What would we do if we wanted to do ascent ? Machine Learning: Chenhao Tan | Boulder | 15 of 27

Gradient Descent Gradient for Logistic Regression Gradient � � ∂ L ( � , . . . , ∂ L ( � β ) β ) ∇ β L ( � β ) = (8) ∂β 0 ∂β n Update ∆ β ≡ η ∇ β L ( � β ) (9) i ← β i − η∂ L ( � β ) β ′ (10) ∂β i η : step size, must be greater than zero Machine Learning: Chenhao Tan | Boulder | 15 of 27

Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan | Boulder | 16 of 27

Gradient Descent Remaining issues • When to stop? • What if β keeps getting bigger? Machine Learning: Chenhao Tan | Boulder | 17 of 27

Gradient Descent Regularized Conditional Log Likelihood Unregularized β ∗ = arg min p ( y ( j ) | x ( j ) , β ) � � − ln (11) β Regularized + 1 β ∗ = arg min p ( y ( j ) | x ( j ) , β ) � � � β 2 − ln 2 µ (12) i β i Machine Learning: Chenhao Tan | Boulder | 18 of 27

Gradient Descent Regularized Conditional Log Likelihood Unregularized β ∗ = arg min � p ( y ( j ) | x ( j ) , β ) � − ln (11) β Regularized + 1 β ∗ = arg min � p ( y ( j ) | x ( j ) , β ) � � β 2 − ln 2 µ (12) i β i µ is “regularization” parameter that trades off between likelihood and having small parameters Machine Learning: Chenhao Tan | Boulder | 18 of 27

Machine Learning: Chenhao Tan University of Colorado Boulder - PowerPoint PPT Presentation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan | Boulder | 1 of 27 Quiz question For a test instance ( x , y ) and a

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 3 Slides adapted from

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 7 Slides adapted from

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 4 Slides adapted from

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 1 Slides adapted from

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 2 Slides adapted from

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 10 Slides adapted from

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Slides adapted from

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 23: Machine

User Level Sentiment Analysis Incorporating Social Networks Chenhao Tan Department of Computer

Natural Language Processing (CSEP 517): Computational Pragmatics Chenhao Tan 2017 c

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 18: Clustering

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 13: Boosting

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 12:

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 21: Reinforcement

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 17: Midterm

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 16:

Leamer Monoids and the Huneke-Wiegand Conjecture Roberto Carlos Pelayo Christopher ONeill

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John Hopcroft Center for Computer

Learning Step Size Controllers for Robust Neural Network Training Christian Daniel et al. Recent

Stepwise Refinement Lecture 6 CGS 3416 Spring 2017 February 6, 2017 Lecture 6CGS 3416 Spring

For Friday Finish reading chapter 3 Skip section 3.5.2 Program 1 Proposal due by 4pm

C Session # 8 By: Saeed Haratian Fall 2015 Outlines Counter-Controlled Repetition

verifiedSCION: Verified Secure Routing Peter Mller Joint work with the verifiedSCION Team at