learning outline
play

LEARNING Outline Math Behind Logistic Regression Visualizing - PowerPoint PPT Presentation

Logistic Regression CSCI 447/547 MACHINE LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss Function Minimizing Log Likelihood Function Batch/Full Logistic Regression Gradient Descent for


  1. Logistic Regression CSCI 447/547 MACHINE LEARNING

  2. Outline  Math Behind Logistic Regression  Visualizing Logistic Regression  Loss Function  Minimizing Log Likelihood Function  Batch/Full Logistic Regression  Gradient Descent for OLS  Gradient Descent for Logistic Regression  Comparing OLS and Logistic Regression  Multi-Class Logistic Regression Using Softmax

  3. OLS Recap  Linear regression  Predicts continuous and potentially unbounded labels based on given features ˆ  Y = XB  B are the coefficients, X is the data matrix  The Issue:  Unbounded output means we cannot use it with discrete classification

  4. Logistic Regression  Classification  Predicts discrete labels based on given features  The Setup  y = 0 or 1 with probabilities 1-p and p  Predict a probability instead of a value  Estimate P(y=1|X)

  5. Definitions  Link Function  Relates mean of distribution to output of linear model  Convert unbounded to bounded predictions  Convert continuous output to discrete interpretation  Typically the link function is exponentially distributed  Logistic Function  Input is ∞ to - ∞ and output is 0 to 1  f(- ∞, ∞) -> (0, 1)  Typically sigmoid function  f(0) = 0.5, f(- ∞) = 0 , f(∞) = 1  This value is a probability

  6. Note on Choice of Link Function  Properties:  Bounded [0,1]  Domain (- ∞, ∞)  Differentiable everywhere  Used in optimization  Increasing function  We do not lose the property of coefficient effect (positive and negative) in moving from linear to logistic

  7. Visualizing Logistic Regression

  8. Logistic Regression Loss  Our Goal  The Full Log-Likelihood  A Note on Minimization

  9. Our Goal  Find P(y=1 | X)  Probability to be estimated 1  Logistic Function: Φ 𝑦 = 1+𝑓 −𝑦  OLS Function Y = XB becomes Y = Φ (XB) 𝜚 1 …  Φ = 𝜚 𝑂  y i = 𝜚 (x i . B)  MLE – Maximum Likelihood Estimation 𝑂  L = 𝑄(𝑍 = 𝑧 𝑗 |𝑦 𝑗 ) 𝑗=1  Independent, so joint probability is the product of each observation

  10. Our Goal 𝑂  L = 𝑄(𝑍 = 𝑧𝑗|𝑦𝑗) 𝑗=1 𝑄 𝑧𝑗 1 − 𝑄 1 − 𝑧𝑗 ) 𝑂 =  𝑗=1  Log-Likelihood, log(L) 𝑂 log(L) = [𝑧 𝑗 log 𝑞 + 1 − 𝑧 𝑗 log(1 − 𝑞)  𝑗=1 𝑂  = [𝑧 𝑗 log 𝜚(𝑦 𝑗 . 𝐶) + 1 − 𝑧 𝑗 log(1 − 𝜚 𝑦 𝑗 . 𝐶 ) 𝑗=1  Since we want to minimize, take the negative log likelihood  -log(L) – will minimize this

  11. The Full Log-Likelihood  Minimize this in logistic regression  So far, similar to OLS  But this does not have an explicit solution so we need to minimize this numerically

  12. Logistic Regression Loss

  13. Gradient Descent

  14. Batch/Full Gradient Descent  The Gradient  Algorithm  1. Chose x randomly  2. Compute gradient of f at x: 𝛼 f(x)  3. Step in the direction of the negative of the gradient  x <- x – η ( 𝛼 f(x))  η = step size (too large, can overshoot, too small, takes too long)  Repeat steps 2 and 3 until convergence  Difference in iterations is not decreasing or have reached a certain number of iterations

  15. Gradient Descent for OLS  Numerical Minimization of OLS  Mean Log-Likelihood  L = -1/N ||y – XB|| 2 2  The Algorithm for OLS

  16. Gradient Descent for Logistic Regression  Numerical Minimization of Logistic Loss  L(B) = - ∑ i [y i log( Φ (x i . B)) + (1-y)log(1- Φ (x i . B))]  Two Components 1  Recall ∅ 𝑦 = 1+𝑓 −𝑦  Because this is symmetric about 1/2:  ∅ 𝑦 + ∅ −𝑦 = 1

  17. Gradient Descent for Logistic Regression  First Term 𝜖 1 𝜖 ))] ) 𝜖𝐶 𝑘0 [log (Φ(𝑦 ∙ 𝐶 𝜖𝐶 𝑘0 𝜚(𝑦 𝑗 ∙ 𝐶  = ∙𝐶) 𝜚(𝑦 𝑗 1+𝑓 − 𝑦 𝑗 .𝐶 𝜖 1  = 1+𝑓 − 𝑦𝑗.𝐶 1 𝜖𝐶 𝑘0 1 = 1 + 𝑓 − 𝑦𝑗.𝐶 2 𝑓 −𝑦𝑗.𝐶 (−𝑦 𝑗𝑘 )  1+𝑓 − 𝑦𝑗.𝐶 1 1+𝑓 𝑦𝑗.𝐶 𝑦 𝑗𝑘0  = = 𝜚(−𝑦 𝑗 ∙ 𝐶)𝑦 𝑗𝑘0 = (1 − 𝜚(𝑦 𝑗 ∙ 𝐶))𝑦 𝑗𝑘0  𝜖 ))] = 𝑦 𝑗 𝜖𝐶 𝑘0 [log (Φ(𝑦 ∙ 𝐶 𝑈 (1 − 𝜚(𝑦 𝑗 ∙ 𝐶))𝑦 𝑗𝑘0 

  18. Gradient Descent for Logistic Regression  Second Term 𝜖 ))] = 𝜖 ) 𝜖𝐶 𝑘0 [log (1 − Φ(𝑦 ∙ 𝐶 𝜖𝐶 log (𝜚 −𝑦 𝑗 ∙ 𝐶  ) = −𝑦 𝑗 𝑈 (1 − 𝜚 −𝑦 𝑗 ∙ 𝐶  ) = −𝑦 𝑗 𝑈 (𝜚 𝑦 𝑗 ∙ 𝐶 

  19. Gradient Descent for Logistic Regression  All Terms 𝜖𝑀(𝐶) )) + )] = − [𝒛 𝒋 𝒚 𝒋 𝑼 (1 − 𝜚(𝑦 ∙ 𝐶 (1 − 𝑧 𝑗 )(−𝑦 𝑗 𝑈 )𝜚(𝑦 𝑗 ∙ 𝐶  𝒋 𝜖𝑪 ) + 𝑦 𝑗 𝑈 𝑧 𝑗 𝜚(𝑦 𝑗 ) - 𝑦 𝑗 𝑈 𝜚(𝑦 𝑗 ) ] = − [𝒛 𝒋 𝒚 𝒋 𝑼 −𝑧 𝑗 𝑦 𝑗 𝑈 𝜚(𝑦 𝑗 ∙ 𝐶 ∙ 𝐶 ∙ 𝐶  𝒋 )) = − 𝒚 𝒋 𝑼 (𝑧 𝑗 − 𝜚(𝑦 𝑗 ∙ 𝐶  𝒋 = −X T 𝒛 − 𝚾 𝒀𝑪  𝜚 1 … where 𝚾 =  𝜚 𝑂

  20. Gradient Descent for Logistic Regression  𝜖𝑀 ) = −𝑌 𝑈 (𝑧 − Φ 𝑌𝐶 𝜖𝐶  where Φ is the individual logistic functions  Normalize this by the number of observations  Divide by N to get the mean loss

  21. Gradient Descent for Logistic Regression  Algorithm:  Pick learning rate η  Initialize B randomly  Iterate: 𝜖𝑀 𝜃 ⟵ 𝐶 − 𝜃 + ] 𝑂 𝑌 𝑈 [𝑧  𝐶 = 𝐶 − Φ 𝑌𝐶 𝜖𝐶

  22. Comparing OLS and Logistic Regression  The Two Update Steps  OLS: 𝑂 𝑌 𝑈 𝑧 2𝜃 ← 𝐶 +  𝐶 − 𝑌𝐶  constant residual error  Logistic: 𝑂 𝑌 𝑈 𝑧 𝜃 ← 𝐶 +  𝐶 − Φ(𝑌𝐶)  constant error

  23. Comparing OLS and Logistic Regression  Regularization  OLS Loss Function under Regularization: 2 + 𝜀 2 𝐶  𝑀 = 𝑧 − 𝑌𝐶 2 2 2  Logistic: 2 + 𝜀 2 𝐶  𝑀 = 𝑧 − Φ(𝑌𝐶) 2 2 2  L2 Norm – Ridge Regression: Uniform Regularization  L1 Norm – LASSO Regression: Dimensionality Reduction

  24. Multi-Class Logistic Regression Using Softmax  Softmax 𝑓 𝐶𝑘∙𝑦𝑗  𝑄 𝑧 = 𝑑 𝑘 𝑦 𝑗 = , 𝑧 ∈ {1, … , 𝑛} 𝑓 𝐶𝑘∙𝑦𝑗 𝑛 𝑘=1 𝑛  𝑄 𝑧 = 𝑑 𝑘 𝑦 𝑗 = 1 , so it is a probability 𝑘=1

  25. Multi-Class Logistic Regression Using Softmax  Comparison with Logistic Regression in the Case of Two Classes 1  2 Classes: 𝑄 𝑧 = 1 𝑦 𝑗 = 1+𝑓 −𝑦𝑗𝐶  Softmax with 2 classes: 𝑓 𝑦𝑗𝐶1 𝑓 −𝑦𝑗𝐶1 𝑄 𝑧 = 1 𝑦 𝑗 = 𝑓 𝑦𝑗𝐶1 +𝑓 𝑦𝑗𝐶2 multiply this by 𝑓 −𝑦𝑗𝐶1 1 =  1+𝑓 −𝑦𝑗(𝐶1−𝐶2)

  26. Softmax Optimization  Classification Probabilities: 𝜌 1,𝑘 𝑓 𝑦𝑗𝐶𝑘 …  Π 𝑗,𝑘 = , 𝜌 𝑘 = 𝑓 𝑦𝑗𝐶𝑙 𝜌 𝑂,𝑘 𝑙  Probabilities of observation i, class j  The Gradients: 𝜖𝐶 𝑘 = −𝑌 𝑈 𝑧 𝑘 𝜖𝑀 − 𝜌 𝑘  )  Logistic: −𝑌 𝑈 (𝑧 − 𝜚 𝑌𝐶  Probability vector replaces sigmoid term  y j us a column vector with all entries 0 except the j th

  27. Softmax Gradient Descent  Algorithm largely unchanged  1. Initialize learning rate  2. Randomly choose B 1 … B m + 𝜃 ← 𝐶 𝑂 𝑌 𝑈 (𝑧 𝑘  3. 𝐶 − 𝜌 𝑘 ) 𝑘 𝑘

  28. Last Notes  Learning Rate η :  Small values take a long time to converge  Large values, and convergence may not happen  Important to monitor loss function on each itertation  Also need to make sure you normalize so values don’t get too large  Gradient descent algorithms across all are very similar

  29. Summary  Math Behind Logistic Regression  Visualizing Logistic Regression  Loss Function  Minimizing Log Likelihood Function  Batch/Full Logistic Regression  Gradient Descent for OLS  Gradient Descent for Logistic Regression  Comparing OLS and Logistic Regression  Multi-Class Logistic Regression Using Softmax

Recommend


More recommend