natural language processing csep 517 sequence models
play

Natural Language Processing (CSEP 517): Sequence Models Noah Smith - PowerPoint PPT Presentation

Natural Language Processing (CSEP 517): Sequence Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu April 17, 2017 1 / 98 To-Do List Online quiz: due Sunday Read: Collins (2011), which has somewhat


  1. Natural Language Processing (CSEP 517): Sequence Models Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu April 17, 2017 1 / 98

  2. To-Do List ◮ Online quiz: due Sunday ◮ Read: Collins (2011), which has somewhat different notation; Jurafsky and Martin (2016a,b,c) ◮ A2 due April 23 (Sunday) 2 / 98

  3. Linguistic Analysis: Overview Every linguistic analyzer is comprised of: 1. Theoretical motivation from linguistics and/or the text domain 2. An algorithm that maps V † to some output space Y . 3. An implementation of the algorithm ◮ Once upon a time: rule systems and crafted rules ◮ Most common now: supervised learning from annotated data ◮ Frontier: less supervision (semi-, un-, reinforcement, distant, . . . ) 3 / 98

  4. Sequence Labeling After text classification ( V † → L ), the next simplest type of output is a sequence labeling . � x 1 , x 2 , . . . , x ℓ � �→ � y 1 , y 2 , . . . , y ℓ � x �→ y Every word gets a label in L . Example problems: ◮ part-of-speech tagging (Church, 1988) ◮ spelling correction (Kernighan et al., 1990) ◮ word alignment (Vogel et al., 1996) ◮ named-entity recognition (Bikel et al., 1999) ◮ compression (Conroy and O’Leary, 2001) 4 / 98

  5. The Simplest Sequence Labeler: “Local” Classifier Define features of a labeled word in context: φ ( x , i, y ) . Train a classifier, e.g., y i = argmax ˆ s ( x , i, y ) y ∈L linear = argmax w · φ ( x , i, y ) y ∈L Decide the label for each word independently. 5 / 98

  6. The Simplest Sequence Labeler: “Local” Classifier Define features of a labeled word in context: φ ( x , i, y ) . Train a classifier, e.g., y i = argmax ˆ s ( x , i, y ) y ∈L linear = argmax w · φ ( x , i, y ) y ∈L Decide the label for each word independently. Sometimes this works! 6 / 98

  7. The Simplest Sequence Labeler: “Local” Classifier Define features of a labeled word in context: φ ( x , i, y ) . Train a classifier, e.g., y i = argmax ˆ s ( x , i, y ) y ∈L linear = argmax w · φ ( x , i, y ) y ∈L Decide the label for each word independently. Sometimes this works! We can do better when there are predictable relationships between Y i and Y i +1 . 7 / 98

  8. Generative Sequence Labeling: Hidden Markov Models ℓ +1 � p ( x , y ) = p ( x i | y i ) · p ( y i | y i − 1 ) i =1 For each state/label y ∈ L : ◮ p ( X i | Y i = y ) is the “emission” distribution for y ◮ p ( Y i | Y i − 1 = y ) is called the “transition” distribution for y Assume Y 0 is always a start state and Y ℓ +1 is always a stop state; x ℓ +1 is always the stop symbol. 8 / 98

  9. Graphical Representation of Hidden Markov Models y 0 y 1 y 2 y 3 y 4 y 5 x 1 x 2 x 3 x 4 x 5 Note: handling of beginning and end of sequence is a bit different than before. Last x is known since p ( � | � ) = 1 . 9 / 98

  10. Structured vs. Not Each of these has an advantage over the other: ◮ The HMM lets the different labels “interact.” ◮ The local classifier makes all of x available for every decision. 10 / 98

  11. Prediction with HMMs The classical HMM tells us to choose: ℓ +1 � argmax p ( x i , | y i ) · p ( y i | y i − 1 ) y ∈L ℓ +1 i =1 How to optimize over |L| ℓ choices without explicit enumeration? 11 / 98

  12. Prediction with HMMs The classical HMM tells us to choose: ℓ +1 � argmax p ( x i , | y i ) · p ( y i | y i − 1 ) y ∈L ℓ +1 i =1 How to optimize over |L| ℓ choices without explicit enumeration? Key: exploit the conditional independence assumptions: Y i ⊥ Y 1: i − 2 | Y i − 1 Y i ⊥ Y i +2: ℓ | Y i +1 12 / 98

  13. Part-of-Speech Tagging Example I suspect the present forecast is pessimistic . noun • • • • • • adj. • • • • adv. • verb • • • • num. • det. • punc. • With this very simple tag set, 7 8 = 5 . 7 million labelings. (Even restricting to the possibilities above, 288 labelings.) 13 / 98

  14. Two Obvious Solutions Brute force: Enumerate all solutions, score them, pick the best. Greedy: Pick each ˆ y i according to: ˆ y i = argmax p ( y | ˆ y i − 1 ) · p ( x i | y ) y ∈L What’s wrong with these? 14 / 98

  15. Two Obvious Solutions Brute force: Enumerate all solutions, score them, pick the best. Greedy: Pick each ˆ y i according to: ˆ y i = argmax p ( y | ˆ y i − 1 ) · p ( x i | y ) y ∈L What’s wrong with these? Consider: “the old dog the footsteps of the young” (credit: Julia Hirschberg) “the horse raced past the barn fell” 15 / 98

  16. Conditional Independence We can get an exact solution in polynomial time! Y i ⊥ Y 1: i − 2 | Y i − 1 Y i ⊥ Y i +2: ℓ | Y i +1 Given the adjacent labels to Y i , others do not matter. Let’s start at the last position, ℓ . . . 16 / 98

  17. High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) 17 / 98

  18. High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) ◮ If, for each value of y ℓ − 1 , we knew the best y 1:( ℓ − 1) , then picking y ℓ would be easy. 18 / 98

  19. High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) ◮ If, for each value of y ℓ − 1 , we knew the best y 1:( ℓ − 1) , then picking y ℓ would be easy. ◮ Idea: for each position i , calculate the score of the best label prefix y 1: i ending in each possible value for Y i . 19 / 98

  20. High-Level View of Viterbi ◮ The decision about Y ℓ is a function of y ℓ − 1 , x ℓ , and nothing else!  �  X ℓ = x ℓ , � � p ( Y ℓ = y | x , y 1:( ℓ − 1) ) = p  Y ℓ = y Y ℓ − 1 = y ℓ − 1 , �  � Y ℓ +1 = � � = p ( Y ℓ = y, X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) p ( X ℓ = x ℓ , Y ℓ − 1 = y ℓ − 1 , Y ℓ +1 = � ) ∝ p ( � | y ) · p ( x ℓ | y ) · p ( y | y ℓ − 1 ) ◮ If, for each value of y ℓ − 1 , we knew the best y 1:( ℓ − 1) , then picking y ℓ would be easy. ◮ Idea: for each position i , calculate the score of the best label prefix y 1: i ending in each possible value for Y i . ◮ With a little bookkeeping, we can then trace backwards and recover the best label sequence. 20 / 98

  21. Chart Data Structure x 1 x 2 . . . x ℓ y y ′ . . . y last 21 / 98

  22. Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max 22 / 98

  23. Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 2 ( y ′ ) s ℓ − 1 ( y ) = p ( x ℓ − 1 | y ) · max 23 / 98

  24. Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 2 ( y ′ ) s ℓ − 1 ( y ) = p ( x ℓ − 1 | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 3 ( y ′ ) s ℓ − 2 ( y ) = p ( x ℓ − 2 | y ) · max 24 / 98

  25. Recurrence First, think about the score of the best sequence. Let s i ( y ) be the score of the best label sequence for x 1: i that ends in y . It is defined recursively: y ′ ∈L p ( y | y ′ ) · s ℓ − 1 ( y ′ ) s ℓ ( y ) = p ( � | y ) · p ( x ℓ | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 2 ( y ′ ) s ℓ − 1 ( y ) = p ( x ℓ − 1 | y ) · max y ′ ∈L p ( y | y ′ ) · s ℓ − 3 ( y ′ ) s ℓ − 2 ( y ) = p ( x ℓ − 2 | y ) · max . . . y ′ ∈L p ( y | y ′ ) · s i − 1 ( y ′ ) s i ( y ) = p ( x i | y ) · max 25 / 98

Recommend


More recommend