si425 nlp
play

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last - PowerPoint PPT Presentation

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last time Naive Bayes Classifier Given X, what is the most probable Y? Y arg max P ( Y y ) P ( X | Y y ) = = new k i k y k i Problems with


  1. SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers

  2. Last time • Naive Bayes Classifier Given X, what is the most probable Y? Y arg max P ( Y y ) P ( X | Y y ) ∏ ← ⎯⎯ = = new k i k y k i

  3. Problems with Naive Bayes Y arg max P ( Y y ) P ( X | Y y ) ← ⎯⎯ = = k k y k • It assumes all n-grams are independent of each other. Wrong! • Example : Shakespeare has unique unigrams like: doth, till, morrow, oft, shall, methinks • Each unigram votes for Shakespeare, making the prediction over- confident. • Ask your 10 friends for an opinion, and they all vote the same way which seems confident — but their opinions already mutually informed each other from prior conversations.

  4. Alternative to Naive Bayes? • We want a model that doesn’t assume independence between the inputs. • Ideally, give weight to an n-gram that helps improve accuracy, but give less to it if other n-grams overlap with that same correct prediction. • Solution : Logistic Regression Maximum Entropy (MaxEnt) • Multinomial logistic regression • Log-linear model • Neural network (single layer) •

  5. Let’s talk about features • All inputs to Logistic Regression are features . • So far we’ve counted n-grams, so think of each n- gram as a feature . • Define a feature function over the text x: f i ( x ) • Each unique n-gram has a feature index i • The function’s value is the n-gram’s count.

  6. Feature Example X1 = the lady doth protest too much methinks - Shakespeare X2 = it was the best of times it was the worst of times - Dickens f7 is unigram ‘the’ F238 is bigram ‘the best’ f 7 ( x 1) = 1 f 238 ( x 1) = 0 f 7 ( x 2) = 2 f 238 ( x 2) = 1

  7. Weights • Once you have features, you just need weights • We want a score for each class label Shakespeare Dickens f 1 ( x ) = 1 1.31 -.23 f 2 ( x ) = 2 0.49 0.72 -0.82 0.1 f 3 ( x ) = 1 1.47 1.31 score ( x , c ) = ∑ w i , c f i ( x ) i

  8. Weights Shakespeare Dickens score ( x , c ) = ∑ w i , c f i ( x ) 1.47 1.31 i But we want probabilities, right? ∑ i w i , c f i ( x ) Z = ∑ c ∑ w i , c f i ( x ) P ( c | x ) = Z i And for easier math later, nice [0,1] … exp(x) exp ( ∑ i w i , c f i ( x )) Z = ∑ c ∑ exp ( w i , c f i ( x )) P ( c | x ) = i Z

  9. Logistic Regression • Logistic Regression is just a vector of weights multiplied by your n-gram vector of counts. P ( c | x ) = 1 Z exp ( ∑ w i , c f i ( x )) i • (and normalize to get probabilities)

  10. Logistic Regression “it was the best of times it was the worst of times” -Dickens it was the best of he she times pizza ok worst f(x) 2 1 2 1 2 0 0 2 0 0 1 Dickens w -0.1 0.05 0.0 0.42 0.12 0.3 0.2 1.1 -1.5 -0.2 0.3 0.03 0.21 -0.03 -0.32 0.01 0.23 0.41 -0.2 -2.1 0 0.18 Shakespeare w Where do these weights come from?

  11. Learning in Logistic Regression • We need to learn the weights • Goal : choose weights that give the “best results” or the weights that give the “least error” • Loss function : measures how wrong our predictions are • K ∑ Loss ( y ) = − 1{ y = k } log p ( y = k | x ) k =1 Example! Loss ( dickens ) = − log p ( dickens | x ) 0.0 when p(y|x)=1.0

  12. Learning in Logistic Regression • Goal : choose weights that give the “least error” K ∑ Loss ( y ) = − 1{ y = k } log p ( y = k | x ) k =1 • Choose weights that give probabilities close to 1.0 to each of the correct labels. But how???

  13. Learning in Logistic Regression • Gradient descent : how to update the weights Find the slope of each wi 1. • Take its partial derivative, of course! Move in the direction of the slope. 2. Update all weights. 3. Recalculate the loss function. 4. Repeat 5.

  14. Learning in Logistic Regression • Gradient descent : how to update the weights Another description with lots of hand waving: 1. Initialize the weights randomly 2. Compute probabilities for all data 3. Jiggle the weights up and down based on mistakes 4. Repeat

  15. ̂ Learning in Logistic Regression • Weight updates The feature value! 1 or 0 ∂ L = ( p ( y = k | x ) − 1{ y = k }) x k ∂ w k Logistic regression w k = w k − α ∂ L ∂ w k • It’s easier than it looks. Compare your probability to the correct answer. Update the weight based on how far off your probability was.

  16. Summary: Logistic Regression • Optimizes P( Y | X ) directly • You define the features (usually n-gram counts) • It learns a vector of weights for each Y value Gradient descent, update weights based on error • • Multiply the feature vector by the weight vector • Output is P(Y=y | X) after normalizing • Choose the most probable Y

Recommend


More recommend