lecture 3 logistic regression
play

Lecture 3: Logistic Regression Feng Li Shandong University - PowerPoint PPT Presentation

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020 Feng Li (SDU) Logistic Regression September 21, 2020 1 / 26 Lecture 3: Logistic Regression Logistic Regression 1 Newtons Method 2


  1. Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020 Feng Li (SDU) Logistic Regression September 21, 2020 1 / 26

  2. Lecture 3: Logistic Regression Logistic Regression 1 Newton’s Method 2 Multiclass Classification 3 Feng Li (SDU) Logistic Regression September 21, 2020 2 / 26

  3. Logistic Regression Classification problem Similar to regression problem, but we would like to predict only a small number of discrete values (instead of continuous values) Binary classification problem: y ∈ { 0 , 1 } where 0 represents negative class, while 1 denotes positive class y ( i ) ∈ { 0 , 1 } is also called the label for the training example Feng Li (SDU) Logistic Regression September 21, 2020 3 / 26

  4. Logistic Regression (Contd.) Logistic regression Use a logistic function (or sigmoid function) g ( z ) = 1 / (1 + e − z ) to continuously approximate discrete classification Feng Li (SDU) Logistic Regression September 21, 2020 4 / 26

  5. Logistic Regression (Contd.) Properties of the sigmoid function Bound: g ( z ) ∈ (0 , 1) Symmetric: 1 − g ( z ) = g ( − z ) Gradient: g ′ ( z ) = g ( z )(1 − g ( z )) Feng Li (SDU) Logistic Regression September 21, 2020 5 / 26

  6. Logistic Regression (Contd.) Logistic regression defines h θ ( x ) using the sigmoid function h θ ( x ) = g ( θ T x ) = 1 / (1 + e − θ T x ) First compute a real-valued “score” ( θ T x ) for input x and then “squash” it between (0 , 1) to turn this score into a probability (of x ’s label being 1) Feng Li (SDU) Logistic Regression September 21, 2020 6 / 26

  7. Logistic Regression (Contd.) Data samples are drawn randomly X : random variable representing feature vector Y : random variable representing label Given an input feature vector x , we have The conditional probability of Y = 1 given X = x Pr( Y = 1 | X = x ; θ ) = h θ ( x ) = 1 / (1 + exp( − θ T x )) The conditional probability of Y = 0 given X = x Pr( Y = 0 | X = x ; θ ) = 1 − h θ ( x ) = 1 / (1 + exp( θ T x )) Feng Li (SDU) Logistic Regression September 21, 2020 7 / 26

  8. Logistic Regression: A Closer Look ... What’s the underlying decision rule in logistic regression? At the decision boundary, both classes are equiprobable; thus, we have Pr( Y = 1 | X = x ; θ ) = Pr( Y = 0 | X = x ; θ ) 1 1 ⇒ 1 + exp( − θ T x ) = 1 + exp( θ T x ) exp( θ T x ) = 1 ⇒ θ T x = 0 ⇒ Therefore, the decision boundary of logistic regression is nothing but a linear hyperplane Feng Li (SDU) Logistic Regression September 21, 2020 8 / 26

  9. Interpreting The Probabilities Recall that 1 Pr( Y = 1 | X = x ; θ ) = 1 + exp( − θ T x ) The “score” θ T x is also a measure of distance of x from the hyper- plane (the score is positive for pos. examples, and negative for neg. examples) High positive score: High probability of label 1 High negative score: Low probability of label 1 (high prob. of label 0) Feng Li (SDU) Logistic Regression September 21, 2020 9 / 26

  10. Logistic Regression Formulation Logistic regression model 1 h θ ( x ) = g ( θ T x ) = 1 + e − θ T x Assume Pr( Y = 1 | X = x ; θ ) = h θ ( x ) and Pr( Y = 0 | X = x ; θ ) = 1 − h θ ( x ) , then we have the following probability mass function p ( y | x ; θ ) = Pr( Y = y | X = x ; θ ) = ( h θ ( x )) y (1 − h θ ( x )) 1 − y where y ∈ { 0 , 1 } Feng Li (SDU) Logistic Regression September 21, 2020 10 / 26

  11. Logistic Regression Formulation (Contd.) Y | X = x ∼ Bernoulli ( h θ ( x )) If we assume y ∈ {− 1 , 1 } instead of y ∈ { 0 , 1 } , then 1 p ( y | x ; θ ) = 1 + exp( − y θ T x ) Assuming the training examples were generated independently, we de- fine the likelihood of the parameters as m p ( y ( i ) | x ( i ) ; θ ) � L ( θ ) = i =1 m ( h θ ( x ( i ) )) y ( i ) (1 − h θ ( x ( i ) )) 1 − y ( i ) � = i =1 Feng Li (SDU) Logistic Regression September 21, 2020 11 / 26

  12. Logistic Regression Formulation (Contd.) Maximize the log likelihood m � y ( i ) log h ( x ( i ) ) + (1 − y ( i ) ) log(1 − h ( x ( i ) ) � � ℓ ( θ ) = log L ( θ ) = i =1 Gradient ascent algorithm θ j ← θ j + α ▽ θ j ℓ ( θ ) for ∀ j , where m y ( i ) − h θ ( x ( i ) ) � · ∂ h θ ( x ( i ) ) ∂ � ℓ ( θ ) = � ∂θ j h θ ( x ( i ) ) 1 − h θ ( x ( i ) ) ∂θ j i =1 m � y ( i ) − h θ ( x ( i ) ) � � x ( i ) = j i =1 Feng Li (SDU) Logistic Regression September 21, 2020 12 / 26

  13. Logistic Regression Formulation (Contd.) ∂ ℓ ( θ ) ∂θ j m � � y ( i ) − 1 − y ( i ) ∂ h θ ( x ) ∂ h θ ( x ) � = h θ ( x ) ∂θ j 1 − h θ ( x ) ∂θ j i =1 m y ( i ) − h θ ( x ( i ) ) � · ∂ h θ ( x ( i ) ) � = h θ ( x ( i ) ) � 1 − h θ ( x ( i ) ) ∂θ j i =1 exp( − θ T x ( i ) ) · x ( i ) m · (1 + exp( − θ T x ( i ) )) 2 � y ( i ) − h θ ( x ( i ) ) � j � = · exp( − θ T x ( i ) ) (1 + exp( − θ T x ( i ) )) 2 i =1 m � y ( i ) − h θ ( x ( i ) ) � x ( i ) � = j i =1 Feng Li (SDU) Logistic Regression September 21, 2020 13 / 26

  14. Newton’s Method Given a differentiable real-valued f : R → R , how can we find x such that f ( x ) = 0 ? 𝑧 0 𝑦 ∗ 𝑦 Feng Li (SDU) Logistic Regression September 21, 2020 14 / 26

  15. Newton’s Method (Contd.) A tangent line L to the curve y = f ( x ) at point ( x 1 , f ( x 1 )) The x -intercept of L x 2 = x 1 − f ( x 1 ) ′ ( x 1 ) f 𝑧 𝑦 " , 𝑔 𝑦 " 𝑀 0 𝑦 ∗ 𝑦 " 𝑦 # 𝑦 Feng Li (SDU) Logistic Regression September 21, 2020 15 / 26

  16. Newton’s Method (Contd.) Repeat the process and get a sequence of approximations x 1 , x 2 , x 3 , · · · 𝑧 𝑦 " , 𝑔 𝑦 " 𝑦 # ,𝑔 𝑦 # 0 𝑦 ∗ 𝑦 " 𝑦 $ 𝑦 # 𝑦 Feng Li (SDU) Logistic Regression September 21, 2020 16 / 26

  17. Newton’s Method (Contd.) In general, if convergence criteria is not satisfied, x ← x − f ( x ) ′ ( x ) f 𝑧 𝑦 " , 𝑔 𝑦 " 𝑦 # ,𝑔 𝑦 # 0 𝑦 ∗ 𝑦 " 𝑦 $ 𝑦 # 𝑦 Feng Li (SDU) Logistic Regression September 21, 2020 17 / 26

  18. Newton’s Method (Contd.) Some properties Highly dependent on initial guess Quadratic convergence once it is sufficiently close to x ∗ ′ = 0, only has linear convergence If f Is not guaranteed to convergence at all, depending on function or initial guess Feng Li (SDU) Logistic Regression September 21, 2020 18 / 26

  19. Newton’s Method (Contd.) To maximize f ( x ), we have to find the stationary point of f ( x ) such that f ′ ( x ) = 0. According to Newton’s method, we have the following update x ← x − f ′ ( x ) f ′′ ( x ) Newton-Raphson method: For ℓ : R n → R , we generalization Newton’s method to the multidi- mensional setting θ ← θ − H − 1 ▽ θ ℓ ( θ ) where H is the Hessian matrix H i , j = ∂ 2 ℓ ( θ ) ∂θ i ∂θ j Feng Li (SDU) Logistic Regression September 21, 2020 19 / 26

  20. Newton’s Method (Contd.) Higher convergence speed than (batch) gradient descent Fewer iterations to approach the minimum However, each iteration is more expensive than the one of gradient descent Finding and inverting an n × n Hessian More details about Newton’s method can be found at https://en.wikipedia.org/wiki/Newton%27s_method Feng Li (SDU) Logistic Regression September 21, 2020 20 / 26

  21. Multiclass Classification Multiclass (or multinomial) classification is the problem of classifying instances into one of the more than two classes The existing multiclass classification techniques can be categorized into Transformation to binary Extension from binary Hierarchical classification Feng Li (SDU) Logistic Regression September 21, 2020 21 / 26

  22. Transformation to Binary One-vs.-rest (one-vs.-all, OvA or OvR, one-against-all, OAA) strategy is to train a single classifier per class, with the samples of that class as positive samples and all other samples as negative ones Inputs: A learning algorithm L , training data { ( x ( i ) , y ( i ) ) } i =1 , ··· , m where y ( i ) ∈ { 1 , ..., K } is the label for the sample x ( i ) Output: A list of classifier f k for k ∈ { 1 , · · · , K } Procedure: For ∀ k ∈ { 1 , · · · . K } , construct a new label z ( i ) for x ( i ) such that z ( i ) = 1 if y ( i ) = k and z ( i ) = 0 otherwise, and then apply L to { ( x ( i ) , z ( i ) ) } i =1 , ··· , m to obtain f k . Higher f k ( x ) implies hight probability that x is in class k Making decision: y ∗ = arg max k f k ( x ) Example: Using SVM to train each binary classifier Feng Li (SDU) Logistic Regression September 21, 2020 22 / 26

  23. Transformation to Binary One-vs.-One (OvO) reduction is to train K ( K − 1) / 2 binary classifiers For the ( s , t )-th classifier: Positive samples: all the points in class s Negative samples: all the points in class t f s , t ( x ) is the decision value for this classifier such that larger f s , t ( x ) implies that label s has higher probability than label t Prediction: �� � f ( x ) = arg max f s , t ( x ) s t Example: using SVM to train each binary classifier Feng Li (SDU) Logistic Regression September 21, 2020 23 / 26

Recommend


More recommend