natural language processing
play

Natural Language Processing Info 159/259 Lecture 10: Sequence - PowerPoint PPT Presentation

Natural Language Processing Info 159/259 Lecture 10: Sequence Labeling 1 (Sept 26, 2017) David Bamman, UC Berkeley POS tagging Labeling the tag thats correct for the context. NNP IN FW IN JJ SYM JJ VBZ VB LS VB VBZ NN NN


  1. Natural Language Processing Info 159/259 
 Lecture 10: Sequence Labeling 1 (Sept 26, 2017) David Bamman, UC Berkeley

  2. POS tagging Labeling the tag that’s correct for the context. NNP IN FW IN JJ SYM JJ VBZ VB LS VB VBZ NN NN VBP DT NN NN VBP NN DT NN Fruit flies like a banana Time flies like an arrow (Just tags in evidence within the Penn Treebank — more are possible!)

  3. Named entity recognition PERS PERS ORG tim cook is the ceo of apple person • person • location • location • organization • 3 or 4-class: 7-class: organization • time • (misc) • money • percent • date •

  4. Supersense tagging artifact artifact motion time group The station wagons arrived at noon, a long shining line motion location location that coursed through the west campus. Noun supersenses (Ciarmita and Altun 2003)

  5. Book segmentation

  6. Sequence labeling x = { x 1 , . . . , x n } y = { y 1 , . . . , y n } • For a set of inputs x with n sequential time steps, one corresponding label y i for each x i

  7. Majority class • Pick the label each word is seen most often with in the training data fruit flies like a banana NN 12 VBZ 7 VB 74 FW 8 NN 3 NNS 1 VBP 31 SYM 13 JJ 28 LS 2 IN 533 JJ 2 IN 1 DT 25820 NNP 2

  8. Naive Bayes • Treat each prediction as independent of the others P (y) P ( x | y) P (y | x ) = � y � � Y P ( y � ) P ( x | y � ) P (VBZ | flies ) = P (VBZ) P ( flies | VBZ) � y � � Y P ( y � ) P ( flies | y � ) Reminder: how do we learn P(y) and P(x|y) from training data?

  9. Logistic regression • Treat each prediction as independent of the others but condition on much more expressive set of features � x � β y � exp P ( y | x ; β ) = � y � � Y exp ( x � β y � ) � x � β VBZ � exp P (VBZ | flies ) = � y � � Y exp ( x � β y � )

  10. Discriminative Features feature example Features are scoped over x i = flies 1 entire observed input xi = car 0 x i-1 = fruit 1 Fruit flies like a banana x i+1 = like 1

  11. Sequences • Models that make independent predictions for elements in a sequence can reason over expressive representations of the input x (including correlations among inputs at different time steps x i and x j . • But they don’t capture another important source of information: correlations in the labels y.

  12. Sequences IN JJ VB NN VBZ VBP Time flies like an arrow

  13. DT NN 41909 Sequences NNP NNP 37696 NN IN 35458 IN DT 35006 JJ NN 29699 DT JJ 19166 NN NN 17484 Most common tag bigrams in NN , 16352 Penn Treebank training IN NNP 15940 NN . 15548 JJ NNS 15297 NNS IN 15146 TO VB 13797 NNP , 13683 IN NN 11565

  14. Sequences x time flies like an arrow y NN VBZ IN DT NN P ( y = NN VBZ IN DT NN | x = time flies like an arrow)

  15. Generative vs. Discriminative models • Generative models specify a joint distribution over the labels and the data. With this you could generate new data P ( x , y ) = P ( y ) P ( x | y ) • Discriminative models specify the conditional distribution of the label y given the data x. These models focus on how to discriminate between the classes P ( y | x )

  16. Generative P ( x | y ) P ( y ) P ( y | x ) = � y � ∈ Y P ( x | y ) P ( y ) P ( y | x ) ∝ P ( x | y ) P ( y ) max P ( x | y ) P ( y ) y How do we parameterize these probabilities when x and y are sequences?

  17. Hidden Markov Model P ( y ) = P ( y 1 , . . . , y n ) Prior probability of label sequence n +1 � P ( y 1 , . . . , y n ) ≈ P ( y i | y i − 1 ) i =1 • We’ll make a first-order Markov assumption and calculate the joint probability as the product the individual factors conditioned only on the previous tag.

  18. Hidden Markov Model P ( y i , . . . , y n ) = P ( y 1 ) × P ( y 2 | y 1 ) × P ( y 3 | y 1 , y 2 ) . . . × P ( y n | y 1 , . . . , y n − 1 ) • Remember: a Markov assumption is an approximation to this exact decomposition (the chain rule of probability)

  19. Hidden Markov Model P ( x | y ) = P ( x 1 , . . . , x n | y 1 , . . . , y n ) N � P ( x 1 , . . . , x n | y 1 , . . . , y n ) ≈ P ( x i | y i ) i =1 • Here again we’ll make a strong assumption: the probability of the word we see at a given time step is only dependent on its label

  20. NNP VBZ NN VBZ is 1121 is 2893 has 854 has 1004 says 420 does 128 does 77 says 109 plans 50 remains 56 expects 47 ‘s 51 ‘s 40 includes 44 wants 31 continues 43 owns 30 makes 40 makes 29 seems 34 hopes 24 comes 33 remains 24 reflects 31 claims 19 calls 30 seems 19 expects 29 estimates 17 goes 27 P ( x i | y i , y i − 1 )

  21. HMM n +1 n � � P ( x 1 , . . . , x n , y 1 , . . . , y n ) ≈ P ( y i | y i − 1 ) P ( x i | y i ) i =1 i =1

  22. HMM P ( y 3 | y 2 ) y 1 y 2 y 3 y 4 y 5 y 6 y 7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 P ( x 3 | y 3 )

  23. HMM P ( V B | NNP ) NNP NNP VB RB DT JJ NN Mr. was not a sensible man Collins P ( was | V B )

  24. Parameter estimation c ( y 1 , y 2 ) P ( y t | y t − 1 ) c ( y 1 ) MLE for both is just counting 
 (as in Naive Bayes) c ( x, y ) P ( x t | y t ) c ( y )

  25. Transition probabilities

  26. Emission probabilities

  27. Smoothing • One solution: add a little probability mass to every element. smoothed estimates maximum likelihood estimate P ( x i | y ) = n i , y + α n y + Vα P ( x i | y ) = n i , y n y same α for all x i n i , y + α i P ( x i | y ) = n i,y = count of word i in class y n y + � V j = 1 α j n y = number of words in y V = size of vocabulary possibly different α for each x i

  28. Decoding • Greedy: proceed left to right, committing to the best tag for each time step (given the sequence seen so far) Fruit flies like a banana NN VB IN DT NN

  29. Decoding DT NN VBD IN DT NN ??? The horse raced past the barn fell

  30. Decoding DT NN VBD IN DT NN ??? The horse raced past the barn fell DT NN VBN IN DT NN VBD Information later on in the sentence can influence the best tags earlier on.

  31. All paths END DT NNP VB NN MD START ^ Janet will back the bill $ Ideally, what we want is to calculate the joint probability of each path and pick the one with the highest probability. But for N time steps and K labels, number of possible paths = K N

  32. 5 word sentence with 45 Penn Treebank tags 45 5 = 184,528,125 different paths 45 20 = 1.16e33 different paths

  33. Viterbi algorithm • Basic idea: if an optimal path through a sequence uses label L at time T, then it must have used an optimal path to get to label L at time T • We can discard all non-optimal paths up to label L at time T

  34. END DT NNP VB NN MD START ^ Janet will back the bill $ • At each time step t ending in label K, we find the max probability of any path that led to that state

  35. v T ( END ) END DT v 1 (DT) NNP v 1 (NNP) VB v 1 (VB) NN v 1 (NN) MD v 1 (MD) START Janet will back the bill What’s the HMM probability of ending in Janet = NNP? P ( y t | y t − 1 ) P ( x t | y t ) P (NNP | START) P ( Janet | NNP)

  36. v T ( END ) END DT v 1 (DT) NNP v 1 (NNP) Best path through time step 1 VB v 1 (VB) ending in tag y (trivially - best NN v 1 (NN) path for all is just START) MD v 1 (MD) START Janet will back the bill v 1 ( y ) = max u ∈ Y [ P ( y t = y | y t − 1 = u ) P ( x t | y t = y )]

  37. v T ( END ) END DT v 1 (DT) v 2 (DT) NNP v 1 (NNP) v 2 (NNP) VB v 1 (VB) v 2 (VB) NN v 1 (NN) v 2 (NN) MD v 1 (MD) v 2 (MD) START Janet will back the bill What’s the max HMM probability of ending in will = MD? First, what’s the HMM probability of a single path ending in will = MD?

  38. v T ( END ) END DT v 1 (DT) v 2 (DT) NNP v 1 (NNP) v 2 (NNP) VB v 1 (VB) v 2 (VB) NN v 1 (NN) v 2 (NN) MD v 1 (MD) v 2 (MD) START Janet will back the bill P ( y 1 | START ) P ( x 1 | y 1 ) × P ( y 2 = MD | y 1 ) P ( x 2 | y 2 = MD)

  39. v T ( END ) END DT v 1 (DT) v 2 (DT) NNP v 1 (NNP) v 2 (NNP) VB Best path through time step 2 v 1 (VB) v 2 (VB) ending in tag MD NN v 1 (NN) v 2 (NN) MD v 1 (MD) v 2 (MD) START Janet will back the bill P (DT | START) × P ( Janet | DT) × P ( y t = MD | P ( y t − 1 = DT) × P ( will | y t = MD) P (NNP | START) × P ( Janet | NNP) × P ( y t = MD | P ( y t − 1 = NNP) × P ( will | y t = MD) P (VB | START) × P ( Janet | VB) × P ( y t = MD | P ( y t − 1 = VB) × P ( will | y t = MD) P (NN | START) × P ( Janet | NN) × P ( y t = MD | P ( y t − 1 = NN) × P ( will | y t = MD) P (MD | START) × P ( Janet | MD) × P ( y t = MD | P ( y t − 1 = MD) × P ( will | y t = MD)

  40. v T ( END ) END DT v 1 (DT) v 2 (DT) NNP v 1 (NNP) v 2 (NNP) VB Best path through time step 2 v 1 (VB) v 2 (VB) ending in tag MD NN v 1 (NN) v 2 (NN) MD v 1 (MD) v 2 (MD) START Janet will back the bill Let’s say the best path ending will = MD includes Janet = NNP. By definition, every other path has lower probability.

Recommend


More recommend