applied machine learning
play

Applied Machine Learning Logistic and Softmax Regression Siamak - PowerPoint PPT Presentation

Applied Machine Learning Logistic and Softmax Regression Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives what are linear classifiers logistic regression model loss function maximum likelihood view multi-class classification


  1. Applied Machine Learning Logistic and Softmax Regression Siamak Ravanbakhsh COMP 551 (Fall 2020)

  2. Learning objectives what are linear classifiers logistic regression model loss function maximum likelihood view multi-class classification

  3. Motivation Logistic Regression is the most commonly reported data science method used at work souce: 2017 Kaggle survey

  4. Classification problem R D ( n ) ∈ dataset of inputs x ( n ) ∈ {0, … , C } and discrete targets y ( n ) ∈ {0, 1} binary classification y linear classification : ⊤ linear decision boundary w x + b how do we find these boundaries? different approaches give different linear classifiers

  5. Using linear regression first idea consider binary classification fit a linear model to predict the label y ∈ {−1, 1} ⊤ w x > 0 ⊤ set the decision boundary as w x = 0 ⊤ w x < 0 { ⊤ y = 1 w x > 0 given a new instance assign the label accordingly ⊤ y = −1 w x < 0

  6. Using linear regression first idea consider binary classification fit a linear model to predict the label y ∈ {−1, 1} ⊤ ( n ) correctly classified w x = 100 > 0 2 99 2 L2 loss due to this instance: (100 − 1) = ′ ⊤ ( n ) = −2 < 0 in correctly classified w x 2 L2 loss due to this instance: (−2 − 1) = 9 correct prediction can have higher loss than the incorrect one! solution: we should try squashing all positive instance together and all the negative ones together

  7. Logistic function ⊤ ⊤ Idea: apply a squashing function to w x → σ ( w x ) desirable property of σ : R → R ⊤ all are squashed close together w x > 0 all are squashed together ⊤ w x < 0 logistic function has these properties the decision boundary is 1 ⊤ ⊤ 1 w x = 0 ⇔ σ ( w x ) = ⊤ σ ( w x ) = 2 ⊤ 1+ e − w x still a linear decision boundary T w x

  8. Logistic regression: model 1 ⊤ f ( x ) = σ ( w x ) = w ⊤ 1+ e − w x z logit logistic function squashing function activation function classifiers for different weights

  9. Logistic regression: model example x = [1, x ] recall the way we included a bias parameter 1 the input feature is generated uniformly in [-5,5] for all the values less than 2 we have y=1 and y=0 otherwise a good fit to this data is the one shown (green) ⊤ 1 f ( x ) = σ ( w x ) = w ⊤ 1+ e − w x in the model shown w ≈ [9.1, −4.5] ^ = σ (−4.5 x + 9.1) that is y 1 x 1 what is our model's decision boundary?

  10. Logistic regression: the loss to find a good model, we need to define the cost (loss) function the best model is the one with the lowest cost cost is the some of loss values for individual points first idea use the misclassification error 1 I ( y = ( , y ) = ^  sign( − ^ )) L 0/1 y y 2 ⊤ σ ( w x ) not a continuous function (in w) hard to optimize

  11. Logistic regression: the loss second idea use the L2 loss 1 ^ 2 L ( , y ) = ^ ( y − ) 2 y y 2 ⊤ σ ( w x ) thanks to squashing, the previous problem is resolved loss is continuous still a problem: hard to optimize (non-convex in w)

  12. Logistic regression: the loss use the cross-entropy loss third idea ( , y ) = ^ − y log( ) − ^ (1 − y ) log(1 − ) ^ L CE y y y ⊤ σ ( w x ) it is convex in w probabilistic interpretation (soon!) examples ( y = 1, ^ = .9) = − log(.9) ( y = 1, ^ = .5) = − log(.5) smaller than L y L y CE CE ( y = 0, ^ = .9) = − log(.1) larger than ( y = 0, ^ = .5) = − log(.5) L y L y CE CE COMP 551 | Fall 2020

  13. Cost function optional we need to optimize the cost wrt. parameters first: simplify N ( n ) ⊤ ( n ) ( n ) ⊤ ( n ) J ( w ) = ∑ n =1 − y log( σ ( w x )) − (1 − y ) log(1 − σ ( w x )) substitute logistic function log ( ) = ⊤ 1 substitute logistic function − w x − log ( 1 + e ) ⊤ 1+ e − w x log ( 1 − ) = log ( ) = ⊤ 1 1 − log ( 1 + e w x ) ⊤ ⊤ 1+ e − w 1+ e w x x ⊤ ⊤ ( n ) − w x ( n ) simplified cost N J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e

  14. Cost function implementing the optional ⊤ ⊤ N ( n ) − w x ( n ) J ( w ) = ∑ n =1 log ( 1 + ) + (1 − y ) log ( 1 + w x ) simplified cost: y e e def cost(w, # D x, # N x D y # N ): z = np.dot(x,w) #N x 1 J = np.mean( y * np.log1p(np.exp(-z)) + (1-y) * np.log1p(np.exp(z)) ) return J why not np.log(1 + np.exp(-z)) ? ϵ log(1 + ϵ ) for small , suffers from floating point inaccuracies In [3]: np.log(1+1e-100) x 2 x 3 Out[3]: 0.0 log(1 + ϵ ) = ϵ − + − ... In [4]: np.log1p(1e-100) 2 3 Out[4]: 1e-100

  15. Example: binary classification classification on Iris flowers dataset : (a classic dataset originally used by Fisher) samples with D=4 features, for each of C=3 N = 50 c species of Iris flower our setting 2 classes (blue vs others) 1 features (petal width + bias)

  16. Example: binary classification we have two weights associated with bias + petal width J ( w ) as a function of these weights w = [0, 0] w 0 bias w ∗ ∗ ∗ σ ( w + w x ) 0 1 w 1 x (petal width)

  17. Gradient how did we find the optimal weights? (in contrast to linear regression, no closed form solution) cost: ⊤ ( n ) ⊤ ( n ) ( n ) − w x ( n ) N J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e ⊤ ( n ) ⊤ ( n ) ( n ) e − w ( n ) x e w x taking partial derivative ∂ ( n ) ( n ) J ( w ) = − y + (1 − ) ∑ n x x y ∂ w d ⊤ ( n ) ⊤ ( n ) d d 1+ e − w 1+ e w x x ( n ) y ( n ) ( n ) ( n ) y ( n ) ^ ( n ) ^ ( n ) ^ ( n ) ( n ) = − x (1 − ) + (1 − ) = ( − ) ∑ n y y x y x y d d d ( n ) y ^ ( n ) ( n ) gradient ∇ J ( w ) = ( − ) ∑ n x y ⊤ ( n ) σ ( w x ) ( n ) y compare to gradient for linear regression ^ ( n ) ( n ) ∇ J ( w ) = ( − ) ∑ n x y ⊤ ( n ) w x COMP 551 | Fall 2020

  18. Probabilistic view ⊤ Interpret the prediction as class probability ^ = p ( y = 1 ∣ x ) = σ ( w x ) y w ⊤ the log-ratio of class probabilities is linear ^ σ ( w x ) 1 ⊤ y log = log = log = w x 1− y ^ 1− σ ( w x ) ⊤ ⊤ e − w x logit function is the inverse of logistic so we have a Bernoulli likelihood ^ ( n ) y ( n ) ( n ) ( n ) ( n ) ⊤ ( n ) ^ ( n ) 1− y ( n ) p ( y ∣ ; w ) = Bernoulli( y ; σ ( w x )) x = (1 − ) y y conditional likelihood of the labels given the inputs ^ ( n ) y ( n ) ^ ( n ) 1− y ( n ) N N ( n ) ( n ) L ( w ) = p ( y ∣ ; w ) = (1 − ) ∏ n =1 ∏ n =1 x y y

  19. Maximum likelihood & logistic regression ^ ( n ) y ( n ) ^ ( n ) 1− y ( n ) N ( n ) ( n ) N likelihood L ( w ) = p ( y ∣ ; w ) = (1 − ) ∏ n =1 ∏ n =1 x y y log likelihood find w that maximizes ∗ N ( n ) ( n ) w = max ∑ n =1 log p ( y ∣ ; w ) x w ( n ) ^ ( n ) ( n ) ^ ( n ) N = max log( ) + (1 − y ) log(1 − ) w ∑ n =1 y y y = min J ( w ) the cross entropy cost function! w so using cross-entropy loss in logistic regression is maximizing conditional likelihood we saw a similar interpretation for linear regression (L2 loss maximizes the conditional Gaussian likelihood) COMP 551 | Fall 2020

  20. Multiclass classification using this probabilistic view we extend logistic regression to multiclass setting binary classification : Bernoulli likelihood: ^ 1− y ^ ∈ [0, 1] Bernoulli( y ∣ ^ ) = ^ y (1 − ) y y y y subject to ^ = σ ( z ) = σ ( w x ) T using logistic function to ensure this y C classes: categorical likelihood I ( y = c ) C Categorical( y ∣ ^ ) = ^ c ∏ c =1 ∑ c y ^ c = 1 y y subject to achieved using softmax function

  21. Softmax generalization of logistic to > 2 classes: σ : R → (0, 1) logistic : produces a single probability probability of the second class is (1 − σ ( z )) R C → Δ C softmax: C p ∈ Δ → p = 1 ∑ c =1 recall: probability simplex c c e zc ^ c = softmax( z ) = y so ^ = 1 ∑ c y c zc ′ ∑ c =1 C e ′ e 2 example 1 softmax([1, 1, 2, 0]) = [ e , e , , ] 2 2 2 2 2 e + e +1 2 e + e +1 2 e + e +1 2 e + e +1 softmax([10, 100, −1]) ≈ [0, 1, 0] if input values are large, softmax becomes similar to argmax so similar to logistic this is also a squashing function

  22. Multiclass classification C classes: categorical likelihood I ( y = c ) C Categorical( y ∣ ^ ) = ^ c ∏ c =1 using softmax to enforce sum-to-one constraint y y c ⊤ e w x 1⊤ C ⊤ ^ c = softmax([ w x , … , w x ]) = y c c ′ ⊤ w x ∑ c ′ e so we have on parameter vector for each class c ⊤ to simplify equations we write z = w x c e zc ^ c = softmax([ z , … , z ]) = y 1 C c zc ′ ∑ c ′ e

  23. Likelihood for multiclass classification C classes: categorical likelihood I ( y = c ) using softmax to enforce sum-to-one constraint C Categorical( y ∣ ^ ) = ^ c ∏ c =1 y y e zc c ⊤ z = ^ c = softmax([ z , … , z ]) = w x y where 1 c C c zc ′ ∑ c ′ e substituting softmax in Categorical likeihood: ( n ) I ( y likelihood ( n ) ( n ) = c ) N C L ({ w }) = softmax([ z , … , z ]) ∏ n =1 ∏ c =1 c c 1 C ( n ) I ( y = c ) ( ) ( n ) e zc N C = ∏ n =1 ∏ c =1 ( n ) zc ′ ∑ c ′ e

Recommend


More recommend