applied machine learning applied machine learning
play

Applied Machine Learning Applied Machine Learning Logistic - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives what are linear classifiers logistic


  1. Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

  2. Learning objectives Learning objectives what are linear classifiers logistic regression model loss function maximum likelihood view multi-class classification 2

  3. Motivation Motivation we have seen KNN for classification we see more classifiers today (linear classifiers) Logistic Regression is the most commonly reported data science method used at work souce: 2017 Kaggle survey 3 . 1

  4. Classification problem Classification problem R D ( n ) ∈ dataset of inputs x ( n ) ∈ {0, … , C } and discrete targets y ( n ) ∈ {0, 1} binary classification y linear classification : decision boundaries are linear ⊤ linear decision boundary w x + b how do we find these boundaries? different approaches give different linear classifiers 3 . 2

  5. Using linear regression Using linear regression first idea 1 ∑ n =1 N ∗ ⊤ ( n ) I ( y ( n ) c )) 2 fit a linear model to each class c: w = arg min ( w − = x w c 2 c c ^ ( n ) ⊤ ( n ) = arg max class label for a new instance is then y w x c c ⊤ ⊤ x = decision boundary between any two classes w w x c ′ c recall 1 ⊤ x = [1, x ] T w x 2 example where are the decision boundaries? T w x 3 but the instances are linearly separable T w x 1 we should be able to find these boundaries where is the problem? x 1 3 . 3

  6. Using linear regression Using linear regression first idea ⊤ ⊤ so we are fitting 2 linear models a x , b x y ∈ {0, 1} Binary classification ⊤ ⊤ a x − b x = 0 decision boundary is here ⊤ w x > 0 ⊤ ( a − b ) x = 0 { ⊤ y = 1 w x > 0 ⊤ w x = 0 ⊤ ⊤ y = 0 w x < 0 w x < 0 so one weight vector is enough 3 . 4

  7. Using linear regression Using linear regression first idea ⊤ ⊤ so we are fitting 2 linear models a x , b x y ∈ {0, 1} Binary classification ⊤ ( n ) correctly classified w x = 100 > 0 2 99 2 L2 loss due to this instance: (100 − 1) = ′ ⊤ ( n ) = −2 < 0 in correctly classified w x 2 L2 loss due to this instance: (−2 − 1) = 9 correct prediction can have higher loss than the incorrect one! solution: we should try squashing all positive instance together and all the negative ones together 3 . 5

  8. Logistic function Logistic function ⊤ ⊤ Idea: apply a squashing function to w x → σ ( w x ) desirable property of σ : R → R ⊤ all are squashed close together w x > 0 all are squashed together ⊤ w x < 0 logistic function has these properties the decision boundary is 1 ⊤ ⊤ 1 w x = 0 ⇔ σ ( w x ) = ⊤ σ ( w x ) = 2 ⊤ 1+ e − w x still a linear decision boundary T w x 3 . 6

  9. Logistic regression: Logistic regression: model model 1 ⊤ ( x ) = σ ( w x ) = f w ⊤ 1+ e − w x z logit logistic function squashing function activation function note the linear decision boundary 3 . 7

  10. Logistic regression: the loss Logistic regression: the loss first idea use the misclassification error 1 I ( y ( ^ , y ) =  sign( = ^ − )) L 0/1 y y 2 ⊤ σ ( w x ) not a continuous function (in w) hard to optimize 3 . 8

  11. Logistic regression: the loss Logistic regression: the loss second idea use the L2 loss 1 ^ 2 ( ^ , y ) = ( y − ) L 2 y y 2 ⊤ σ ( w x ) thanks to squashing, the previous problem is resolved loss is continuous still a problem: hard to optimize (non-convex in w) 3 . 9

  12. Logistic regression: the loss Logistic regression: the loss third idea use the cross-entropy loss ( ^ , y ) = − y log( ^ ) − (1 − y ) log(1 − ^ ) L CE y y y ⊤ σ ( w x ) it is convex in w probabilistic interpretation (soon!) 3 . 10 Winter 2020 | Applied Machine Learning (COMP551)

  13. Cost Cost function function we need to optimize the cost wrt. parameters first: simplify N ( n ) ⊤ ( n ) ( n ) ⊤ ( n ) J ( w ) = ∑ n =1 − y log( σ ( w x )) − (1 − y ) log(1 − σ ( w x )) substitute logistic function log ( ) = ⊤ 1 substitute logistic function − w x − log ( 1 + e ) ⊤ 1+ e − w x log ( 1 − ) = log ( ) = ⊤ 1 1 − log ( 1 + e w x ) ⊤ ⊤ 1+ e − w 1+ e w x x ⊤ ⊤ ( n ) − w x ( n ) simplified cost N J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e 4 . 1

  14. Cost function Cost function implementing the ⊤ ⊤ N ( n ) − w x ( n ) J ( w ) = ∑ n =1 log ( 1 + ) + (1 − y ) log ( 1 + w x ) simplified cost: y e e def cost(w, # D X, # N x D y # N ): z = np.dot(X,w) #N x 1 J = np.mean( y * np.log1p(np.exp(-z)) + (1-y) * np.log1p(np.exp(z)) ) return J why not np.log(1 + np.exp(-z)) ? ϵ log(1 + ϵ ) for small , suffers from floating point inaccuracies In [3]: np.log(1+1e-100) x 2 x 3 Out[3]: 0.0 log(1 + ϵ ) = ϵ − + − ... In [4]: np.log1p(1e-100) 2 3 Out[4]: 1e-100 4 . 2

  15. Example Example: binary classification : binary classification classification on Iris flowers dataset : (a classic dataset originally used by Fisher) samples with D=4 features, for each of C=3 = 50 N c species of Iris flower our setting 2 classes (blue vs others) 1 features (petal width + bias) 4 . 3

  16. Example Example: binary classification : binary classification we have two weights associated with bias + petal width J ( w ) as a function of these weights w = [0, 0] w 0 bias w ∗ ∗ ∗ σ ( w + x ) w 0 1 w 1 x (petal width) 4 . 4

  17. Gradient Gradient how did we find the optimal weights? (in contrast to linear regression, no closed form solution) cost: ⊤ ( n ) ⊤ ( n ) ( n ) − w x ( n ) N J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e ⊤ ( n ) ⊤ ( n ) ( n ) e − w ( n ) x e w x taking partial derivative ∂ ( n ) ( n ) J ( w ) = − y + (1 − ) ∑ n x x y ∂ w ⊤ ( n ) ⊤ ( n ) d d 1+ e − w 1+ e w x x d ( n ) y ( n ) ( n ) ( n ) y ( n ) ^ ( n ) ^ ( n ) ^ ( n ) ( n ) = − x (1 − ) + (1 − ) = ( − ) ∑ n y y x y x y d d d ( n ) y ^ ( n ) ( n ) gradient ∇ J ( w ) = ( − ) ∑ n x y ⊤ ( n ) σ ( w x ) ⊤ ( n ) w x compare to gradient for linear regression ( n ) y ^ ( n ) ( n ) ∇ J ( w ) = ( − ) ∑ n x y 4 . 5 Winter 2020 | Applied Machine Learning (COMP551)

  18. Probabilistic view of logistic regression Probabilistic view of logistic regression probabilistic interpretation of logistic regression 1 ⊤ ^ = ( y = 1 ∣ x ) = = σ ( w x ) y p w ⊤ 1+ e − w x ^ logit function is the inverse of logistic y ⊤ log = w x 1− ^ y the log-ratio of class probabilities is linear likelihood probability of data as a function of model parameters ( n ) ( n ) ( n ) ⊤ ( n ) ^ ( n ) y ( n ) L ( w ) = p ( y ∣ ) = Bernoulli( y ; σ ( w x )) ^ ( n ) 1− y ( n ) x = (1 − ) w y y is a function of w ( n ) ^ ( n ) = 1 is the probability of y y not a probability distribution function ^ ( n ) y ( n ) likelihood of the dataset L ( w ) = ^ ( n ) 1− y ( n ) N ( n ) ( n ) N ( y ∣ ) = (1 − ) ∏ n =1 ∏ n =1 p x y y w 5 . 1

  19. Maximum likelihood & logistic regression Maximum likelihood & logistic regression ^ ( n ) y ( n ) likelihood ^ ( n ) 1− y ( n ) N ( n ) ( n ) N L ( w ) = ( y ∣ ) = (1 − ) ∏ n =1 ∏ n =1 p x y y w maximum likelihood use the model that maximizes the likelihood of observations ∗ w = arg max L ( w ) w likelihood value blows up for large N, work with log-likelihood instead (same maximum) log likelihood N ( n ) ( n ) max log p ( y ∣ ) w ∑ n =1 x w ( n ) ^ ( n ) ( n ) ^ ( n ) N = max log( ) + (1 − y ) log(1 − ) w ∑ n =1 y y y = min J ( w ) the cross entropy cost function! w so using cross-entropy loss in logistic regression is maximizing conditional likelihood 5 . 2

  20. Maximum likelihood & linear regression Maximum likelihood & linear regression squared error loss also has max-likelihood interpretation ⊤ 2 ( y − w x ) − ⊤ 2 1 cond. probability p ( y ∣ x ) = N ( y ∣ w x , σ ) = e 2 σ 2 w 2 πσ 2 mean μ ⊤ y w x σ 2 variance σ standard deviation (don't confuse with logistic function) x image: http://blog.nguyenvq.com/blog/2009/05/12/linear-regression-plot-with-normal-curves-for-error-sideways/ 5 . 3

  21. Maximum likelihood & Maximum likelihood & linear regression linear regression squared error loss also has max-likelihood interpretation ⊤ 2 ( y − w x ) − 1 cond. probability p ⊤ 2 ( y ∣ x ) = N ( y ∣ w x , σ ) = e 2 σ 2 w 2 πσ 2 T y N ( n ) ( n ) w x likelihood L ( w ) = ( y ∣ ) ∏ n =1 p x w 1 ( n ) ⊤ ( n ) 2 log likelihood ℓ( w ) = ∑ n − ( y − ) + constants w x 2 σ 2 1 ∑ n ∗ ( n ) ⊤ ( n ) 2 optimal params. w = arg max ℓ( w ) = arg min ( y − ) w x w 2 w linear least squares! x image: http://blog.nguyenvq.com/blog/2009/05/12/linear-regression-plot-with-normal-curves-for-error-sideways/ 5 . 4 Winter 2020 | Applied Machine Learning (COMP551)

Recommend


More recommend