linear and logistic regression
play

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude


  1. Linear and Logistic Regression Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

  2. Goals for the lecture • understand the concepts • linear regression • closed form solution for linear regression • lasso • RMSE, MAE, and R-square • logistic regression for linear classification • gradient descent for logistic regression • multiclass logistic regression

  3. Linear regression 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 = 𝑚 2 loss; also called mean squared error Hypothesis class 𝓘

  4. Linear regression: optimization 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 = • Let 𝑌 be a matrix whose 𝑗 -th row is 𝑦 (𝑗) 𝑈 , 𝑧 be the vector 𝑈 𝑧 (1) , … , 𝑧 (𝑛) 𝑛 𝑥 = 1 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 = 1 ෠ 2 𝑀 𝑔 𝑛 ෍ 𝑛 ⃦𝑌𝑥 − 𝑧 ⃦ 2 𝑗=1

  5. Linear regression: optimization • Set the gradient to 0 to get the minimizer 1 2 = 0 𝑥 ෠ 𝛼 𝑀 𝑔 𝑥 = 𝛼 𝑛 ⃦𝑌𝑥 − 𝑧 ⃦ 2 𝑥 𝑥 [ 𝑌𝑥 − 𝑧 𝑈 (𝑌𝑥 − 𝑧)] = 0 𝛼 𝑥 [ 𝑥 𝑈 𝑌 𝑈 𝑌𝑥 − 2𝑥 𝑈 𝑌 𝑈 𝑧 + 𝑧 𝑈 𝑧] = 0 𝛼 2𝑌 𝑈 𝑌𝑥 − 2𝑌 𝑈 𝑧 = 0 w = 𝑌 𝑈 𝑌 −1 𝑌 𝑈 𝑧

  6. Linear regression: optimization • Algebraic view of the minimizer • If 𝑌 is invertible, just solve 𝑌𝑥 = 𝑧 and get 𝑥 = 𝑌 −1 𝑧 • But typically 𝑌 is a tall matrix 𝑥 = 𝑥 = 𝑌 𝑈 𝑌 𝑌 𝑈 𝑧 𝑌 𝑧 Normal equation: w = 𝑌 𝑈 𝑌 −1 𝑌 𝑈 𝑧

  7. Linear regression with bias Bias term 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 to minimize the loss • Find 𝑔 • Reduce to the case without bias: • Let 𝑥 ′ = 𝑥; 𝑐 , 𝑦 ′ = 𝑦; 1 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 = 𝑥 ′ 𝑈 (𝑦 ′ ) • Then 𝑔

  8. Linear regression with lasso penalty 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Find 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes 𝑛 𝑥 = 1 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 + 𝜇 𝑥 1 ෠ 𝑀 𝑔 𝑛 ෍ 𝑗=1 lasso penalty: 𝑚 1 norm of the parameter, encourages sparsity

  9. Evaluation Metrics • Root mean squared error (RMSE) • Mean absolute error (MAE) – average 𝑚 1 error • R-square (R-squared) • Historically all were computed on training data, and possibly adjusted after, but really should cross-validate

  10. R-square • Formulation 1: • Formulation 2: square of Pearson correlation coefficient r between the label and the prediction. Recall for x, y:    ( x x )( y y )  i i i r     2 2 ( x x ) ( y y ) i i i i

  11. Linear classification 𝑥 𝑈 𝑦 = 0 𝑥 𝑈 𝑦 > 0 𝑥 𝑈 𝑦 < 0 𝑥 Class 1 Class 0

  12. Linear classification: natural attempt 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Hypothesis 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 • 𝑧 = 1 if 𝑥 𝑈 𝑦 > 0 • 𝑧 = 0 if 𝑥 𝑈 𝑦 < 0 Linear model 𝓘 𝑥 𝑦 ) = step(𝑥 𝑈 𝑦) • Prediction: 𝑧 = step(𝑔

  13. Linear classification: natural attempt 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Find 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 to minimize 𝑛 𝑥 = 1 𝕁[step(𝑥 𝑈 𝑦 𝑗 ) ≠ 𝑧 (𝑗) ] ෠ 𝑀 𝑔 𝑛 ෍ 𝑗=1 • Drawback: difficult to optimize 0-1 loss • NP-hard in the worst case

  14. Linear classification: simple approach 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 = Reduce to linear regression; ignore the fact 𝑧 ∈ {0,1}

  15. Linear classification: simple approach Drawback: not robust to “outliers” Figure borrowed from Pattern Recognition and Machine Learning , Bishop

  16. Compare the two 𝑧 𝑧 = 𝑥 𝑈 𝑦 𝑧 = step(𝑥 𝑈 𝑦) 𝑥 𝑈 𝑦

  17. Between the two • Prediction bounded in [0,1] • Smooth 1 • Sigmoid: 𝜏 𝑏 = 1+exp(−𝑏) Figure borrowed from Pattern Recognition and Machine Learning , Bishop

  18. Linear classification: sigmoid prediction • Squash the output of the linear function 1 Sigmoid 𝑥 𝑈 𝑦 = 𝜏 𝑥 𝑈 𝑦 = 1 + exp(−𝑥 𝑈 𝑦) 𝜏(𝑥 𝑈 𝑦 𝑗 ) − 𝑧 (𝑗) 2 1 • Find 𝑥 that minimizes ෠ 𝑛 𝑛 σ 𝑗=1 𝑀 𝑔 𝑥 =

  19. Linear classification: logistic regression • Squash the output of the linear function 1 Sigmoid 𝑥 𝑈 𝑦 = 𝜏 𝑥 𝑈 𝑦 = 1 + exp(−𝑥 𝑈 𝑦) • A better approach: Interpret as a probability 1 𝑥 (𝑧 = 1|𝑦) = 𝜏 𝑥 𝑈 𝑦 = 𝑄 1 + exp(−𝑥 𝑈 𝑦) 𝑥 𝑧 = 1 𝑦 = 1 − 𝜏 𝑥 𝑈 𝑦 𝑄 𝑥 𝑧 = 0 𝑦 = 1 − 𝑄

  20. Linear classification: logistic regression 𝑥 𝑈 𝑦 (𝑗) − 𝑧 (𝑗) 2 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes ෠ 𝑛 • Find 𝑔 𝑀 𝑔 𝑥 = 𝑛 σ 𝑗=1 • Find 𝑥 that minimizes 𝑛 𝑀 𝑥 = − 1 𝑥 𝑧 (𝑗) 𝑦 (𝑗) ෠ 𝑛 ෍ log 𝑄 𝑗=1 𝑀 𝑥 = − 1 log𝜏(𝑥 𝑈 𝑦 (𝑗) ) − 1 log[1 − 𝜏 𝑥 𝑈 𝑦 (𝑗) ] ෠ 𝑛 ෍ 𝑛 ෍ 𝑧 (𝑗) =1 𝑧 (𝑗) =0 Logistic regression: MLE with sigmoid

  21. Linear classification: logistic regression 𝑦 𝑗 , 𝑧 (𝑗) : 1 ≤ 𝑗 ≤ 𝑛 i.i.d. from distribution 𝐸 • Given training data • Find 𝑥 that minimizes 𝑀 𝑥 = − 1 log𝜏(𝑥 𝑈 𝑦 (𝑗) ) − 1 log[1 − 𝜏 𝑥 𝑈 𝑦 (𝑗) ] ෠ 𝑛 ෍ 𝑛 ෍ 𝑧 (𝑗) =1 𝑧 (𝑗) =0 No close form solution; Need to use gradient descent

  22. Properties of sigmoid function • Bounded 1 𝜏 𝑏 = 1 + exp(−𝑏) ∈ (0,1) • Symmetric exp −𝑏 1 1 − 𝜏 𝑏 = 1 + exp −𝑏 = exp 𝑏 + 1 = 𝜏(−𝑏) • Gradient exp −𝑏 𝜏 ′ (𝑏) = 2 = 𝜏(𝑏)(1 − 𝜏 𝑏 ) 1 + exp −𝑏

  23. Review: binary logistic regression • Sigmoid 1 𝜏 𝑥 𝑈 𝑦 + 𝑐 = 1 + exp(−(𝑥 𝑈 𝑦 + 𝑐)) • Interpret as conditional probability 𝑞 𝑥 𝑧 = 1 𝑦 = 𝜏 𝑥 𝑈 𝑦 + 𝑐 𝑞 𝑥 𝑧 = 0 𝑦 = 1 − 𝑞 𝑥 𝑧 = 1 𝑦 = 1 − 𝜏 𝑥 𝑈 𝑦 + 𝑐 • How to extend to multiclass?

  24. Review: binary logistic regression • Suppose we model the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 and class probabilities 𝑞 𝑧 = 𝑗 • Conditional probability by Bayesian rule: 𝑞 𝑦|𝑧 = 1 𝑞(𝑧 = 1) 1 𝑞 𝑧 = 1|𝑦 = 𝑞 𝑦|𝑧 = 1 𝑞 𝑧 = 1 + 𝑞 𝑦|𝑧 = 2 𝑞(𝑧 = 2) = 1 + exp(−𝑏) = 𝜏(𝑏) where we define 𝑏 ≔ ln 𝑞 𝑦|𝑧 = 1 𝑞(𝑧 = 1) 𝑞 𝑦|𝑧 = 2 𝑞(𝑧 = 2) = ln 𝑞 𝑧 = 1|𝑦 𝑞 𝑧 = 2|𝑦

  25. Review: binary logistic regression • Suppose we model the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 and class probabilities 𝑞 𝑧 = 𝑗 • 𝑞 𝑧 = 1|𝑦 = 𝜏 𝑏 = 𝜏(𝑥 𝑈 𝑦 + 𝑐) is equivalent to setting log odds to be linear: 𝑏 = ln 𝑞 𝑧 = 1|𝑦 𝑞 𝑧 = 2|𝑦 = 𝑥 𝑈 𝑦 + 𝑐 • Why linear log odds?

  26. Review: binary logistic regression • Suppose the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{− 1 1 2 } 𝑞 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 2 𝑦 − 𝜈 𝑗 • log odd is 𝑏 = ln 𝑞 𝑦|𝑧 = 1 𝑞(𝑧 = 1) 𝑞 𝑦|𝑧 = 2 𝑞(𝑧 = 2) = 𝑥 𝑈 𝑦 + 𝑐 where 𝑐 = − 1 𝑈 𝜈 1 + 1 𝑈 𝜈 2 + ln 𝑞(𝑧 = 1) 𝑥 = 𝜈 1 − 𝜈 2 , 2 𝜈 1 2 𝜈 2 𝑞(𝑧 = 2)

  27. Multiclass logistic regression • Suppose we model the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 and class probabilities 𝑞 𝑧 = 𝑗 • Conditional probability by Bayesian rule: 𝑞 𝑦|𝑧 = 𝑗 𝑞(𝑧 = 𝑗) exp(𝑏 𝑗 ) 𝑞 𝑧 = 𝑗|𝑦 = σ 𝑘 𝑞 𝑦|𝑧 = 𝑘 𝑞(𝑧 = 𝑘) = σ 𝑘 exp(𝑏 𝑘 ) where we define 𝑏 𝑗 ≔ ln [𝑞 𝑦 𝑧 = 𝑗 𝑞 𝑧 = 𝑗 ]

  28. Multiclass logistic regression • Suppose the class-conditional densities 𝑞 𝑦 𝑧 = 𝑗 is normal 2𝜌 𝑒/2 exp{− 1 1 2 } 𝑞 𝑦 𝑧 = 𝑗 = 𝑂 𝑦|𝜈 𝑗 , 𝐽 = 2 𝑦 − 𝜈 𝑗 • Then 𝑈 = − 1 2 𝑦 𝑈 𝑦 + 𝑥 𝑗 𝑦 + 𝑐 𝑗 𝑏 𝑗 ≔ ln 𝑞 𝑦 𝑧 = 𝑗 𝑞 𝑧 = 𝑗 where 𝑐 𝑗 = − 1 1 𝑥 𝑗 = 𝜈 𝑗 , 𝑈 𝜈 𝑗 + ln 𝑞 𝑧 = 𝑗 + ln 2 𝜈 𝑗 2𝜌 𝑒/2

Recommend


More recommend