NLP Programming Tutorial 6 – Advanced Discriminative Learning NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of Science and Technology (NAIST) 1
NLP Programming Tutorial 6 – Advanced Discriminative Learning Review: Classifiers and the Perceptron 2
NLP Programming Tutorial 6 – Advanced Discriminative Learning Prediction Problems Given x, predict y 3
NLP Programming Tutorial 6 – Advanced Discriminative Learning Example we will use: ● Given an introductory sentence from Wikipedia ● Predict whether the article is about a person Give Predict n Gonso was a Sanron sect priest (754-827) Yes! in the late Nara and early Heian periods. Shichikuzan Chigogataki Fudomyoo is No! a historical site located at Magura, Maizuru City, Kyoto Prefecture. ● This is binary classification 4
NLP Programming Tutorial 6 – Advanced Discriminative Learning Mathematical Formulation sign ( w ⋅ ϕ ( x )) y = I sign ( ∑ i = 1 w i ⋅ϕ i ( x )) = ● x: the input ● φ(x) : vector of feature functions {φ 1 (x), φ 2 (x), …, φ I (x)} ● w : the weight vector {w 1 , w 2 , …, w I } ● y: the prediction, +1 if “yes”, -1 if “no” ● (sign(v) is +1 if v >= 0, -1 otherwise) 5
NLP Programming Tutorial 6 – Advanced Discriminative Learning Online Learning create map w for I iterations for each labeled pair x, y in the data phi = create_features (x) y' = predict_one (w, phi) if y' != y update_weights (w, phi, y) ● In other words ● Try to classify each training example ● Every time we make a mistake, update the weights ● Many different online learning algorithms ● The most simple is the perceptron 6
NLP Programming Tutorial 6 – Advanced Discriminative Learning Perceptron Weight Update w ← w + y ϕ( x ) ● In other words: ● If y=1, increase the weights for features in φ (x) – Features for positive examples get a higher weight ● If y=-1, decrease the weights for features in φ (x) – Features for negative examples get a lower weight → Every time we update, our predictions get better! update_weights ( w, phi, y ) for name, value in phi : w [ name ] += value * y 7
NLP Programming Tutorial 6 – Advanced Discriminative Learning Stochastic Gradient Descent and Logistic Regression 8
NLP Programming Tutorial 6 – Advanced Discriminative Learning Perceptron and Probabilities P ( y ∣ x ) ● Sometimes we want the probability ● Estimating confidence in predictions ● Combining with other systems ● However, perceptron only gives us a prediction y = sign ( w ⋅ϕ( x )) 1 In other words: p(y|x) 0.5 P ( y = 1 ∣ x )= 1 if w ⋅ ϕ ( x )≥ 0 P ( y = 1 ∣ x )= 0 if w ⋅ϕ ( x )< 0 0 -10 -5 0 5 10 9 w*phi(x)
NLP Programming Tutorial 6 – Advanced Discriminative Learning The Logistic Function ● The logistic function is a “softened” version of the function used in the perceptron w ⋅ ϕ ( x ) P ( y = 1 ∣ x )= e w ⋅ϕ( x ) 1 + e Perceptron Logistic Function 1 1 p(y|x) p(y|x) 0.5 0.5 0 0 -10 -5 0 5 10 -10 -5 0 5 10 w*phi(x) w*phi(x) ● Can account for uncertainty 10 ● Differentiable
NLP Programming Tutorial 6 – Advanced Discriminative Learning Logistic Regression ● Train based on conditional likelihood ● Find the parameters w that maximize the conditional likelihood of all answers y i given the example x i ∏ i P ( y i ∣ x i ; w ) ̂ w = argmax w ● How do we solve this? 11
NLP Programming Tutorial 6 – Advanced Discriminative Learning Stochastic Gradient Descent ● Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw ● In other words ● For every training example, calculate the gradient (the direction that will increase the probability of y) ● Move in that direction, multiplied by learning rate α 12
NLP Programming Tutorial 6 – Advanced Discriminative Learning Gradient of the Logistic Function ● Take the derivative of the probability 0.4 dp(y|x)/dw*phi(x) w ⋅ ϕ ( x ) d d e 0.3 d w P ( y = 1 ∣ x ) = d w w ⋅ϕ( x ) 0.2 1 + e 0.1 w ⋅ ϕ ( x ) e 0 ϕ ( x ) = -10 -5 0 5 10 w ⋅ϕ( x ) ) 2 ( 1 + e w*phi(x) w ⋅ϕ( x ) d d w ( 1 − e d d w P ( y =− 1 ∣ x ) w ⋅ϕ( x ) ) = 1 + e w ⋅ϕ( x ) e −ϕ( x ) = w ⋅ϕ( x ) ) 2 ( 1 + e 13
NLP Programming Tutorial 6 – Advanced Discriminative Learning Example: Initial Update ● Set α=1, initialize w = 0 y = -1 x = A site , located in Maizuru , Kyoto 0 d e w ⋅ ϕ ( x )= 0 d w P ( y =− 1 ∣ x ) − 2 ϕ( x ) = 0 ) ( 1 + e − 0.25 ϕ( x ) = w ← w +− 0.25 ϕ ( x ) w unigram “Maizuru” = -0.25 w unigram “A” = -0.25 w unigram “,” = -0.5 w unigram “site” = -0.25 w unigram “in” = -0.25 w unigram “located” = -0.25 14 w unigram “Kyoto” = -0.25
NLP Programming Tutorial 6 – Advanced Discriminative Learning Example: Second Update y = 1 x = Shoken , monk born in Kyoto -0.5 -0.25 -0.25 1 d e w ⋅ϕ( x ) =− 1 d w P ( y = 1 ∣ x ) 2 ϕ ( x ) = 1 ) ( 1 + e 0.196 ϕ( x ) = w ← w + 0.196 ϕ ( x ) w unigram “Maizuru” = -0.25 w unigram “A” = -0.25 w unigram “Shoken” = 0.196 w unigram “,” = -0.304 w unigram “site” = -0.25 w unigram “monk” = 0.196 w unigram “in” = -0.054 w unigram “located” = -0.25 w unigram “born” = 0.196 15 w unigram “Kyoto” = -0.054
NLP Programming Tutorial 6 – Advanced Discriminative Learning SGD Learning Rate? ● How to set the learning rate α? ● Usually decay over time: α= 1 C + t parameter number of samples ● Or, use held-out data, and reduce the learning rate when the likelihood rises 16
NLP Programming Tutorial 6 – Advanced Discriminative Learning Classification Margins 17
NLP Programming Tutorial 6 – Advanced Discriminative Learning Choosing between Equally Accurate Classifiers ● Which classifier is better? Dotted or Dashed? O X O X O X 18
NLP Programming Tutorial 6 – Advanced Discriminative Learning Choosing between Equally Accurate Classifiers ● Which classifier is better? Dotted or Dashed? O X O X O X ● Answer: Probably the dashed line. ● Why?: It has a larger margin. 19
NLP Programming Tutorial 6 – Advanced Discriminative Learning What is a Margin? ● The distance between the classification plane and the nearest example: O X O X O X 20
NLP Programming Tutorial 6 – Advanced Discriminative Learning Support Vector Machines ● Most famous margin-based classifier ● Hard Margin: Explicitly maximize the margin ● Soft Margin: Allow for some mistakes ● Usually use batch learning ● Batch learning: slightly higher accuracy, more stable ● Online learning: simpler, less memory, faster convergence ● Learn more about SVMs: http://disi.unitn.it/moschitti/material/Interspeech2010-Tutorial.Moschitti.pdf ● Batch learning libraries: LIBSVM, LIBLINEAR, SVMLite 21
NLP Programming Tutorial 6 – Advanced Discriminative Learning Online Learning with a Margin ● Penalize not only mistakes, but also correct answers under a margin create map w for I iterations for each labeled pair x, y in the data phi = create_features (x) val = w * phi * y if val <= margin ★ update_weights (w, phi, y) (A correct classifier will always make w * phi * y > 0) If margin = 0, this is the perceptron algorithm 22
NLP Programming Tutorial 6 – Advanced Discriminative Learning Regularization 23
NLP Programming Tutorial 6 – Advanced Discriminative Learning Cannot Distinguish Between Large and Small Classifiers ● For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park ● Which classifier is better? Classifier 1 Classifier 2 he +3 bird -1 saw -5 robbery +1 a +0.5 bird -1 robbery +1 in +5 the -3 24 park -2
NLP Programming Tutorial 6 – Advanced Discriminative Learning Cannot Distinguish Between Large and Small Classifiers ● For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park ● Which classifier is better? Classifier 1 Classifier 2 he +3 bird -1 saw -5 robbery +1 Probably classifier 2! a +0.5 It doesn't use bird -1 irrelevant information. robbery +1 in +5 the -3 25 park -2
NLP Programming Tutorial 6 – Advanced Discriminative Learning Regularization ● A penalty on adding extra weights ● L2 regularization: 5 ● Big penalty on large weights, 4 small penalty on small weights ● High accuracy 3 L2 ● L1 regularization: 2 L1 ● Uniform increase whether large 1 or small 0 -2 -1 0 1 2 ● Will cause many weights to become zero → small model 26
NLP Programming Tutorial 6 – Advanced Discriminative Learning L1 Regularization in Online Learning ● After update, reduce the weight by a constant c update_weights ( w, phi, y, c ) for name, value in w : ★ If abs. value < c, if abs ( value ) < c : ★ set weight to zero w [ name ] = 0 ★ else : ★ If value > 0, w[ name ] -= sign ( value ) * c decrease by c ★ for name, value in phi : If value < 0, w [ name ] += value * y increase by c 27
Recommend
More recommend