Machine Learning Lecture 03: Logistic Regression and Gradient Descent Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and KP Murphy (2012). Machine learning: a probabilistic perspective. MIT Press. (Chapter 8) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning . MIT press. www.deeplearningbook.org . (Chapter 4) Andrew Ng. Lecture Notes on Machine Learning. Stanford. Hal Daume. A Course on Machine Learning. http://ciml.info/ Nevin L. Zhang (HKUST) Machine Learning 1 / 51
Logistic Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 2 / 51
Logistic Regression Recap of Probabilistic Linear Regression Training set D = { x i , y i } N i =1 , where y i ∈ R . Probabilistic model: p ( y | x , θ ) = N ( y | µ ( x ) , σ 2 ) = N ( y | w ⊤ x , σ 2 ) Learning: Determining w by minimizing the cross entropy loss: N J ( w ) = − 1 � log p ( y i | x i , w ) N i =1 Point estimation of y : y = µ ( x ) = w ⊤ x ˆ Nevin L. Zhang (HKUST) Machine Learning 3 / 51
Logistic Regression Logistic Regression (for Classification) Training set D = { x i , y i } N i =1 , where y i ∈ { 0 , 1 } . Probabilistic model: p ( y | x , w ) = Ber ( y | σ ( w ⊤ x )) σ ( z ) is the sigmoid/logistic/logit function . e z 1 σ ( z ) = 1 + exp( − z ) = e z + 1 It maps the real line R to [0 , 1]. Not to be confused with the variance σ 2 in the Gaussian distribution Nevin L. Zhang (HKUST) Machine Learning 4 / 51
Logistic Regression Logistic Regression The model: p ( y | x , w ) = Ber ( y | σ ( w ⊤ x )) 1 σ ( w ⊤ x ) = p ( y = 1 | x , w ) = 1 + exp( − w ⊤ x ) exp( − w ⊤ x ) 1 − σ ( w ⊤ x ) = p ( y = 0 | x , w ) = 1 + exp( − w ⊤ x ) Consider the logit of p ( y = 1 | x , w ) p ( y = 1 | x , w ) log p ( y = 1 | x , w ) log = 1 − p ( y = 1 | x , w ) p ( y = 0 | x , w ) log exp( w ⊤ x ) = w ⊤ x . = So, a linear model for the logit. Hence called logistic regression . Nevin L. Zhang (HKUST) Machine Learning 5 / 51
Logistic Regression Logistic Regression: Decision Rule To classify instances, we obtain a point estimation of y : y = arg max y p ( y | x , w ) ˆ In other words, the decision/classification rule is: y = 1 iff p ( y = 1 | x , w ) > 0 . 5 ˆ This is called the optimal Bayes classifier : Suppose the same situation is to occur many times. You will always make mistakes no matter what decision rule to use. The probability of mistakes is minimized if you use the above rule. Nevin L. Zhang (HKUST) Machine Learning 6 / 51
Logistic Regression Logistic Regression is a Linear Classifier In fact, the decision/classification rule in logistic regression is equivalent to: y = 1 iff w ⊤ x > 0 ˆ So,it is a linear classifier with a decision boundary . Example: Whether a students is admitted based on the results of two exams: Nevin L. Zhang (HKUST) Machine Learning 7 / 51
Logistic Regression Logistic Regression: Example Solid black dots are the data. Those at the bottom are the SAT scores of applicants rejected by a university, and those at the top are the SAT scores of applicants accepted by a university. The red circles are the predicted probabilities that the applicants would be accepted. Nevin L. Zhang (HKUST) Machine Learning 8 / 51
Logistic Regression Logistic Regression: 2D Example The decision boundary p ( y = 1 | x , w ) = 0 . 5 is a straight line in the feature space. Nevin L. Zhang (HKUST) Machine Learning 9 / 51
Logistic Regression Parameter Learning We would like to find the MLE of w , i.e, the values of w that minimizes the � N cross entropy loss: J ( w ) = − 1 i =1 log P ( y i | x i , w ) N Consider a general training example ( x , y ). Because y is binary, we have log P ( y | x , w ) = y log ˆ y + (1 − y ) log(1 − ˆ y ) (ˆ y = P ( y = 1 | x , w )) y log σ ( w ⊤ x ) + (1 − y ) log(1 − σ ( w ⊤ x )) . = Hence, N J ( w ) = − 1 � [ y i log σ ( w ⊤ x i ) + (1 − y i ) log(1 − σ ( w ⊤ x i ))] N i =1 Unlike linear regression, we can no longer write down the MLE in closed form. Instead, we need to use optimization algorithms to compute it. Gradient descent Newton’s method Nevin L. Zhang (HKUST) Machine Learning 10 / 51
Gradient Descent Outline 1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 11 / 51
Gradient Descent Gradient Descent Consider a function y = J ( w ) of a scalar variable w . The derivative of J ( w ) is defined as follows: J ′ ( w ) = df ( w ) J ( w + ǫ ) − J ( w ) = lim dw ǫ ǫ → 0 When ǫ is small, we have J ( w + ǫ ) ≈ J ( w ) + ǫ J ′ ( w ) This equation tells use how to reduce J ( w ) by changing w in small steps: If J ′ ( w ) > 0, make ǫ negative, i.e., decrease w ; If J ′ ( w ) < 0, make ǫ positive, i.e., increase w . In other words, move in the opposite direction of the derivative (gradient) Nevin L. Zhang (HKUST) Machine Learning 12 / 51
Gradient Descent Gradient Descent To implement the idea, we update w as follows: w ← w − α J ′ ( w ) The term − J ′ ( w ) means that we move in the opposite direction of the derivative, and α determines how much we move in that direction. It is called the step size in optimization and the learning rate in machine learning Nevin L. Zhang (HKUST) Machine Learning 13 / 51
Gradient Descent Gradient Descent Consider a function y = J ( w ), where w = ( w 0 , w 1 , . . . , w D ) ⊤ . The gradient of J at w is defined as ∇ J = ( dJ , dJ , . . . , dJ ) ⊤ dw 0 dw 1 dw D The gradient is the direction along which J increases the fastest. If we want to reduce J as fast as possible, move in the opposite direction of the gradient. Nevin L. Zhang (HKUST) Machine Learning 14 / 51
Gradient Descent Gradient Descent The method of steepest descent/gradient descent for minimizing J ( w ) 1 Initialize w 2 Repeat until convergence w ← w − α ∇ J ( w ) The learning rate α usually changes from iteration to iteration. Nevin L. Zhang (HKUST) Machine Learning 15 / 51
Gradient Descent Choice of Learning Rate Constant learning rate is difficult to set: Too small, convergence will be very slow Too large, the method can fail to converge at all. Better to vary the learning rate. Will discuss this more later. Nevin L. Zhang (HKUST) Machine Learning 16 / 51
Gradient Descent Gradient Descent Gradient descent can get stuck at local minima or saddle points Nonetheless, it usually works well. Nevin L. Zhang (HKUST) Machine Learning 17 / 51
Gradient Descent for Logistic Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Gradient Descent for Logistic Regression 4 Newton’s Method 5 Softmax Regression 6 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 18 / 51
Gradient Descent for Logistic Regression Derivative of σ ( z ) To apply gradient descent to logistic regression, we need to compute the partial derivative of J ( w ) w.r.t each weight w j . Before doing that, first consider the derivative of the sigma function: d σ ( z ) = d 1 σ ′ ( z ) = 1 + e − z dz dz d (1 + e − z ) 1 − = (1 + e − z ) 2 dz 1 (1 + e − z ) 2 e − z = 1 1 = 1 + e − z (1 − 1 + e − z ) σ ( z )(1 − σ ( z )) = Nevin L. Zhang (HKUST) Machine Learning 19 / 51
Gradient Descent for Logistic Regression Derivative of log P ( w | x , w ) z = w ⊤ x , x = [ x 0 , x 1 , . . . , x D ] ⊤ , w = [ w 0 , w 1 , . . . , w D ] ⊤ . ∂σ ( z ) = d σ ( z ) ∂ z = σ ( z )(1 − σ ( z )) x j ∂ w j dz ∂ w j ∂ ∂ log P ( y | x , w ) [ y log σ ( z ) + (1 − y ) log(1 − σ ( z ))] = ∂ w j ∂ w j 1 ∂σ ( z ) 1 ∂σ ( z ) = − (1 − y ) y σ ( z ) 1 − σ ( z ) ∂ w j ∂ w j [ y (1 − σ ( z )) − (1 − y ) σ ( z )] x j = [ y − σ ( z )] x j . = Nevin L. Zhang (HKUST) Machine Learning 20 / 51
Gradient Descent for Logistic Regression Derivative of the Cross Entropy Loss The i -th training example: x i = [ x i , 0 , x i , 1 , . . . , x i , D ] ⊤ N ∂ J ( w ) − 1 ∂ � log P ( y i | x i , w ) = ∂ w j N ∂ w j i =1 N − 1 � = [ y i − σ ( z i )] x i , j N i =1 Nevin L. Zhang (HKUST) Machine Learning 21 / 51
Gradient Descent for Logistic Regression Batch Gradient Descent The Batch Gradient Descent algorithm for logistic regression Repeat until convergence N w j ← w j + α 1 � [ y i − σ ( w ⊤ x i )] x i , j N i =1 Interpretation: (Assume x i are positive vectors) If predicted value σ ( w ⊤ x i ) is smaller than the actual value y i , there is reason to increase w j . The increment is proportional to x i , j . If predicted value σ ( w ⊤ x i ) is larger than the actual value y i , there is reason to decrease w j . The decrement is proportional to x i , j . Nevin L. Zhang (HKUST) Machine Learning 22 / 51
Recommend
More recommend