grammars graphs and automata
play

Grammars, graphs and automata Mark Johnson Brown University ALTA - PowerPoint PPT Presentation

Grammars, graphs and automata Mark Johnson Brown University ALTA summer school December 2003 slides available from http:/ /www.cog.brown.edu/mj 1 Topics Graphical models and Bayes networks (Hidden) Markov models (Probabilistic)


  1. Languages and Grammars If V is a set of symbols (the vocabulary , i.e., words, letters, phonemes, etc): • V ⋆ is the set of all strings (or finite sequences) of members of V (including the empty sequence ǫ ) • V + is the set of all finite non-empty strings of members of V A language is a subset of V ⋆ (i.e., a set of strings) A probabilistic language is probability distribution P over V ⋆ , i.e., • ∀ w ∈ V ⋆ 0 ≤ P( w ) ≤ 1 • � w ∈V ⋆ P( w ) = 1, i.e., P is normalized A (probabilistic) grammar is a finite specification of a (probabilistic) language 20

  2. Trees depict constituency Some grammars G define a language by defining a set of trees Ψ G . The strings G generates are the terminal yields of these trees. S Nonterminals NP VP VP PP NP NP Preterminals Pro V D N P D N I saw the man with the telescope Terminals or terminal yield Trees represent how words combine to form phrases and ultimately sentences. 21

  3. Probabilistic grammars Some probabilistic grammars G defines a probability distribution P G ( ψ ) over the set of trees Ψ G , and hence over strings w ∈ V ⋆ . � P G ( w ) = P G ( ψ ) ψ ∈ Ψ G ( w ) where Ψ G ( w ) are the trees with yield w generated by G Standard (non-stochastic) grammars distinguish grammatical from ungrammatical strings (only the grammatical strings receive parses). Probabilistic grammars can assign non-zero probability to every string, and rely on the probability distribution to distinguish likely from unlikely strings. 22

  4. Context free grammars A context-free grammar G = ( V , S , s, R ) consists of: • V , a finite set of terminals ( V 0 = { Sam , Sasha , thinks , snores } ) • S , a finite set of non-terminals disjoint from V ( S 0 = { S , NP , VP , V } ) • R , a finite set of productions of the form A → X 1 . . . X n , where A ∈ S and each X i ∈ S ∪ V • s ∈ S is called the start symbol ( s 0 = S ) G generates a tree ψ iff Productions • The label of ψ ’s root node is s S S → NP VP • For all local trees with parent A NP VP NP → Sam and children X 1 . . . X n in ψ Sam V S NP → Sasha A → X 1 . . . X n ∈ R VP → V thinks NP VP G generates a string w ∈ V ⋆ iff w is VP → V S Sasha V the terminal yield of a tree generated V → thinks by G snores V → snores 23

  5. CFGs as “plugging” systems S − S → NP VP S + VP → V NP NP − VP − NP → Sam NP + VP + NP → George S V → hates V − NP − NP VP V → likes V + NP + V NP Productions Sam hates George hates − George − Sam − hates + George + Sam + “Pluggings” Resulting tree • Goal: no unconnected “sockets” or “plugs” • The productions specify available types of components • In a probabilistic CFG each type of component has a “price” 24

  6. Structural Ambiguity R 1 = { VP → V NP , VP → VP PP , NP → D N , N → N PP , . . . } S S NP VP NP VP I VP PP I V NP V NP P NP saw D N saw D N with D N the N PP the man the telescope man P NP with D N • CFGs can capture structural ambiguity in language. the telescope • Ambiguity generally grows exponentially in the length of the string. – The number of ways of parenthesizing a string of length n is Catalan( n ) • Broad-coverage statistical grammars are astronomically ambiguous. 25

  7. Derivations A CFG G = ( V , S , s, R ) induces a rewriting relation ⇒ G , where γAδ ⇒ G γβδ iff A → β ∈ R and γ, δ ∈ ( S ∪ V ) ⋆ . A derivation of a string w ∈ V ⋆ is a finite sequence of rewritings ⇒ ⋆ s ⇒ G . . . ⇒ G w . G is the reflexive and transitive closure of ⇒ G . The language generated by G is { w : s ⇒ ⋆ w, w ∈ V ⋆ } . G 0 = ( V 0 , S 0 , S , R 0 ), V 0 = { Sam , Sasha , likes , hates } , S 0 = { S , NP , VP , V } , R 0 = { S → NP VP , VP → V NP , NP → Sam , NP → Sasha , V → likes , V → hates } S Steps in a terminating ⇒ NP VP derivation are always cuts in S ⇒ NP V NP a parse tree NP VP ⇒ Sam V NP Sam V NP ⇒ Sam V Sasha Left-most and right-most ⇒ Sam likes Sasha derivations are unique likes Sasha 26

  8. Enumerating trees and parsing strategies A parsing strategy specifies the order in which nodes in trees are enumerated Parent Child 1 . . . Child n Top-down Left-corner Bottom-up Parsing strategy Pre-order In-order Post-order Enumeration Parent Child 1 Child 1 Child 1 Parent . . . . . . . . . Child n Child n Child n Parent 27

  9. Top-down parses are left-most derivations (1) Leftmost derivation S S Productions S → NP VP NP → D N D → no N → politican VP → V V → lies 28

  10. Top-down parses are left-most derivations (2) Leftmost derivation S S NP VP NP VP Productions S → NP VP NP → D N D → no N → politican VP → V V → lies 29

  11. Top-down parses are left-most derivations (3) Leftmost derivation S S NP VP NP VP D N VP Productions D N S → NP VP NP → D N D → no N → politican VP → V V → lies 30

  12. Top-down parses are left-most derivations (4) Leftmost derivation S S NP VP NP VP D N VP no N VP Productions D N S → NP VP no NP → D N D → no N → politican VP → V V → lies 31

  13. Top-down parses are left-most derivations (5) Leftmost derivation S S NP VP NP VP D N VP no N VP Productions D N no politican VP S → NP VP no politican NP → D N D → no N → politican VP → V V → lies 32

  14. Top-down parses are left-most derivations (6) Leftmost derivation S S NP VP NP VP D N VP no N VP Productions D N V no politican VP S → NP VP no politican V no politican NP → D N D → no N → politican VP → V V → lies 33

  15. Top-down parses are left-most derivations (7) Leftmost derivation S S NP VP NP VP D N VP Productions no N VP D N V S → NP VP no politican VP no politican V NP → D N no politican lies no politican lies D → no N → politican VP → V V → lies 34

  16. Bottom-up parses are reversed rightmost-most derivations (1) Rightmost derivation Productions S → NP VP NP → D N no politican lies no politican lies D → no N → politican VP → V V → lies 35

  17. Bottom-up parses are reversed rightmost-most derivations (2) Rightmost derivation Productions D S → NP VP D politican lies NP → D N no politican lies no politican lies D → no N → politican VP → V V → lies 36

  18. Bottom-up parses are reversed rightmost-most derivations (3) Rightmost derivation Productions D N S → NP VP D N lies D politican lies NP → D N no politican lies no politican lies D → no N → politican VP → V V → lies 37

  19. Bottom-up parses are reversed rightmost-most derivations (4) Rightmost derivation NP Productions NP lies D N S → NP VP D N lies D politican lies NP → D N no politican lies no politican lies D → no N → politican VP → V V → lies 38

  20. Bottom-up parses are reversed rightmost-most derivations (5) Rightmost derivation NP NP V Productions NP lies D N V S → NP VP D N lies D politican lies NP → D N no politican lies no politican lies D → no N → politican VP → V V → lies 39

  21. Bottom-up parses are reversed rightmost-most derivations (6) Rightmost derivation NP VP NP VP NP V Productions NP lies D N V S → NP VP D N lies D politican lies NP → D N no politican lies no politican lies D → no N → politican VP → V V → lies 40

  22. Bottom-up parses are reversed rightmost-most derivations (7) Rightmost derivation S S NP VP NP VP NP V Productions NP lies D N V S → NP VP D N lies D politican lies NP → D N no politican lies no politican lies D → no N → politican VP → V V → lies 41

  23. Probabilistic Context Free Grammars A Probabilistic Context Free Grammar (PCFG) G consists of • a CFG ( V , S , S, R ) with no useless productions, and • production probabilities p ( A → β ) = P( β | A ) for each A → β ∈ R , the conditional probability of an A expanding to β A production A → β is useless iff it is not used in any terminating derivation, i.e., there are no derivations of the form S ⇒ ⋆ γAδ ⇒ γβδ ⇒ ∗ w for any γ, δ ∈ ( N ∪ T ) ⋆ and w ∈ T ⋆ . If r 1 . . . r n is a sequence of productions used to generate a tree ψ , then P G ( ψ ) = p ( r 1 ) . . . p ( r n ) � p ( r ) f r ( ψ ) = r ∈R where f r ( ψ ) is the number of times r is used in deriving ψ � ψ P G ( ψ ) = 1 if p satisfies suitable constraints 42

  24. Example PCFG 1 . 0 S → NP VP 1 . 0 VP → V 0 . 75 NP → George 0 . 25 NP → Al 0 . 6 V → barks 0 . 4 V → snores     S S     NP VP NP VP             P = 0 . 45 P = 0 . 1     George V Al V             barks snores 43

  25. Topics • Graphical models and Bayes networks • Markov chains and hidden Markov models • (Probabilistic) context-free grammars • (Probabilistic) finite-state machines • Computation with PCFGs • Estimation of PCFGs • Lexicalized and bi-lexicalized PCFGs • Non-local dependencies and log-linear models • Stochastic unification-based grammars 44

  26. Finite-state automata - Informal description Finite-state automata are devices that generate arbitrarily long strings one symbol at a time. At each step the automaton is in one of a finite number of states. Processing proceeds as follows: 1. Initialize the machine’s state s to the start state and w = ǫ (the empty string) 2. Loop: (a) Based on the current state s , decide whether to stop and return w (b) Based on the current state s , append a certain symbol x to w and update to s ′ Mealy automata choose x based on s and s ′ Moore automata (homogenous HMMs) choose x based on s ′ alone Note: I’m simplifying here: Mealy and Moore machines are transducers In probabilistic automata, these actions are directed by probability distributions 45

  27. Mealy finite-state automata Mealy automata emit terminals from arcs. A (Mealy) automaton M = ( V , S , s 0 , F , M ) consists of: a • V , a set of terminals , ( V 3 = { a , b } ) a • S , a finite set of states , ( S 3 = { 0 , 1 } ) • s 0 ∈ S , the start state , ( s 0 3 = 0 ) 0 1 • F ⊆ S , the set of final states ( F 3 = { 1 } ) and • M ⊆ S × V × S , the state transition relation . b ( M 3 = { ( 0 , a , 0 ) , ( 0 , a , 1 ) , ( 1 , b , 0 ) } ) A accepting derivation of a string v 1 . . . v n ∈ V ⋆ is a sequence of states s 0 . . . s n ∈ S ⋆ where: • s 0 is the start state • s n ∈ F , and • for each i = 1 . . . n , ( s i − 1 , v i , s i ) ∈ M . 00101 is an accepting derivation of aaba . 46

  28. Probabilistic Mealy automata A probabilistic Mealy automaton M = ( V , S , s 0 , p f , p m ) consists of: • terminals V , states S and start state s 0 ∈ S as before, • p f ( s ), the probability of halting at state s ∈ S , and • p m ( v, s ′ | s ), the probability of moving from s ∈ S to s ′ ∈ S and emitting a v ∈ V . where p f ( s ) + � v ∈V ,s ′ ∈S p m ( v, s ′ | s ) = 1 for all s ∈ S (halt or move on) The probability of a derivation with states s 0 . . . s n and outputs v 1 . . . v n is: � n � � P M ( s 0 . . . s n ; v 1 . . . v n ) = p m ( v i , s i | s i − 1 ) p f ( s n ) i =1 a a Example: p f ( 0 ) = 0 , p f ( 1 ) = 0 . 1 , p m ( a , 0 | 0 ) = 0 . 2 , p m ( a , 1 | 0 ) = 0 . 8 , p m ( b , 0 | 1 ) = 0 . 9 0 1 P M ( 00101 , aaba ) = 0 . 2 × 0 . 8 × 0 . 9 × 0 . 8 × 0 . 1 b 47

  29. Bayes net representation of Mealy PFSA In a Mealy automaton, the output is determined by the current and next state. . . . S i − 1 S i S i +1 . . . . . . V i V i +1 . . . Example: state sequence 00101 for string aaba a 0 0 1 0 1 a 0 1 a a b a b Bayes net for aaba Mealy FSA 48

  30. The trellis for a Mealy PFSA Example: state sequence 00101 for string aaba a 0 0 1 0 1 a 0 1 a a b a Bayes net for aaba b 0 0 0 0 0 1 1 1 1 1 a a b a 49

  31. Probabilistic Mealy FSA as PCFGs Given a Mealy PFSA M = ( V , S , s 0 , p f , p m ), let G M have the same terminals, states and start state as M , and have productions • s → ǫ with probability p f ( s ) for all s ∈ S • s → v s ′ with probability p m ( v, s ′ | s ) for all s, s ′ ∈ S and v ∈ V p ( 0 → a 0 ) = 0 . 2 , p ( 0 → a 1 ) = 0 . 8 , p ( 1 → ǫ ) = 0 . 1 , p ( 1 → b 0 ) = 0 . 9 0 a a 0 a a 1 0 1 b 0 b a 1 PCFG parse of aaba Mealy FSA The FSA graph depicts the machine (i.e., all strings it generates), while the CFG tree depicts the analysis of a single string. 50

  32. Moore finite state automata Moore machines emit terminals from states. A Moore finite state automaton M = ( V , S , s 0 , F , M , L ) is composed of: • V , S , s 0 and F are terminals, states, start state and final states as before • M ⊆ S × S , the state transition relation • L ⊆ S × V , the state labelling function ( V 4 = { a , b } , S 4 = { 0 , 1 } , s 0 4 = 0 , F 4 = { 1 } , M 4 = { ( 0 , 0 ) , ( 0 , 1 ) , ( 1 , 0 ) } , L 4 = { ( 0 , a ) , ( 0 , b ) , ( 1 , b ) } ) A derivation of v 1 . . . v n ∈ V ⋆ is a sequence of states s 0 . . . s n ∈ S ⋆ where: • s 0 is the start state, s n ∈ F , • ( s i − 1 , s i ) ∈ M , for i = 1 . . . n • ( s i , v i ) ∈ L for i = 1 . . . n { a , b } { b } 0101 is an accepting derivation of bab 51

  33. Probabilistic Moore automata A probabilistic Moore automaton M = ( V , S , s 0 , p f , p m , p ℓ ) consists of: • terminals V , states S and start state s 0 ∈ S as before, • p f ( s ), the probability of halting at state s ∈ S , • p m ( s ′ | s ), the probability of moving from s ∈ S to s ′ ∈ S , and • p ℓ ( v | s ), the probability of emitting v ∈ V from state s ∈ S . where p f ( s ) + � s ′ ∈S p m ( s ′ | s ) = 1 and � v ∈V p ℓ ( v | s ) = 1 for all s ∈ S . The probability of a derivation with states s 0 . . . s n and output v 1 . . . v n is � n � � P M ( s 0 . . . s n ; v 1 . . . v n ) = p m ( s i | s i − 1 ) p ℓ ( v i | s i ) p f ( s n ) i =1 Example: p f ( 0 ) = 0 , p f ( 1 ) = 0 . 1 , p ℓ ( a | 0 ) = 0 . 4 , p ℓ ( b | 0 ) = 0 . 6 , p ℓ ( b | 1 ) = 1 , { a , b } { b } p m ( 0 | 0 ) = 0 . 2 , p m ( 1 | 0 ) = 0 . 8 , p m ( 0 | 1 ) = 0 . 9 P M ( 0101 , bab ) = (0 . 8 × 1) × (0 . 9 × 0 . 4) × (0 . 8 × 1) × 0 . 1 52

  34. Bayes net representation of Moore PFSA In a Moore automaton, the output is determined by the current state, just as in an HMM (in fact, Moore automata are HMMs) . . . S i − 1 S i S i +1 . . . V i − 1 V i V i +1 Example: state sequence 0101 for string bab 0 1 0 1 { a , b } { b } b a b Bayes net for bab Moore FSA 53

  35. Trellis representation of Moore PFSA Example: state sequence 0101 for string bab 0 1 0 1 { a , b } { b } b a b Bayes net for bab Moore FSA 0 0 0 0 1 1 1 b a b 54

  36. Probabilistic Moore FSA as PCFGs Given a Moore PFSA M = ( V , S , s 1 , p f , p m , p ℓ ), let G M have the same terminals and start state as M , two nonterminals s and ˜ s for each state s ∈ S , and productions s ′ s ′ with probability p m ( s ′ | s ) • s → ˜ • s → ǫ with probability p f ( s ) • ˜ s → v with probability p ℓ ( v | s ) p ( 0 → ˜ 0 0 ) = 0 . 2 , p ( 0 → ˜ 1 1 ) = 0 . 8 , 0 p ( 1 → ǫ ) = 0 . 1 , p ( 1 → ˜ 0 0 ) = 0 . 9 , p ( ˜ 0 → a ) = 0 . 4 , p ( ˜ 0 → b ) = 0 . 6 , p ( ˜ ˜ 1 → b ) = 1 1 1 ˜ b 0 0 ˜ a 1 1 { a , b } { b } b PCFG parse of bab Moore FSA 55

  37. Bi-tag POS tagging HMM or Moore PFSA whose states are POS tags Start NNP VB NNS $ Howard likes mangoes $ Start NNP ′ NNP VB ′ VB NNS ′ NNS Howard likes mangoes 56

  38. Mealy vs Moore automata • Mealy automata emit terminals from arcs – a probabilistic Mealy automaton has |V||S| 2 + |S| parameters • Moore automata emit terminals from states – a probabilistic Moore automaton has ( |V| + 1) |S| parameters In a POS-tagging application, |S| ≈ 50 and |V| ≈ 2 × 10 4 • A Mealy automaton has ≈ 5 × 10 7 parameters • A Moore automaton has ≈ 10 6 parameters A Moore automaton seems more reasonable for POS-tagging The number of parameters grows rapidly as the number of states grows ⇒ Smoothing is a practical necessity 57

  39. Tri-tag POS tagging Start NNP VB NNS $ Howard likes mangoes $ Start Start NNP ′ Start NNP VB ′ NNP VB NNS ′ VB NNS Howard likes mangoes Given a set of POS tags T , the tri-tag PCFG has productions t ′ → v t 0 t 1 → t ′ 2 t 1 t 2 for all t 0 , t 1 , t 2 ∈ T and v ∈ V 58

  40. Advantages of using grammars PCFGs provide a more flexible structural framework than HMMs and FSA Sesotho is a Bantu language with rich agglutinative morphology A two-level HMM seems appropriate: • upper level generates a sequence of words, and • lower level generates a sequence of morphemes in a word START VERB’ VERB SM’ SM NOUN’ NOUN TNS’ TNS PRE’ PRE VS’ VS NS’ NS di jo o tla pheha (s)he will cook food 59

  41. Finite state languages and linear grammars • The classes of all languages generated by Mealy and Moore FSA is the same. These languages are called finite state languages . • The finite state languages are also generated by left-linear and by right-linear CFGs. – A CFG is right linear iff every production is of the form A → β or A → β B for B ∈ S and β ∈ V ⋆ ( nonterminals only appear at the end of productions ) – A CFG is left linear iff every production is of the form A → β or A → B β for B ∈ S and β ∈ V ⋆ ( nonterminals only appear at the beginning of productions ) • The language ww R , where w ∈ { a , b } ⋆ and w R is the reverse of w , is not a finite state language, but it is generated by a CFG ⇒ some context-free languages are not finite state languages 60

  42. Things you should know about FSA • FSA are good ways of representing dictionaries and morphology • Finite state transducers can encode phonological rules • The finite state languages are closed under intersection , union and complement • FSA can be determinized and minimized • There are practical algorithms for computing these operations on large automata • All of this extends to probabilistic finite-state automata • Much of this extends to PCFGs and tree automata 61

  43. Topics • Graphical models and Bayes networks • Markov chains and hidden Markov models • (Probabilistic) context-free grammars • (Probabilistic) finite-state machines • Computation with PCFGs • Estimation of PCFGs • Lexicalized and bi-lexicalized PCFGs • Non-local dependencies and log-linear models • Stochastic unification-based grammars 62

  44. Binarization Almost all efficient CFG parsing algorithms require productions have at most two children. Binarization can be done as a preprocessing step, or implicitly during parsing. A B 1 B 2 B 3 B 4 A A A B 1 B 2 B 3 B 4 B 1 HB 3 B 4 B 1 B 2 B 3 B 4 B 1 B 2 B 3 HB 3 B 4 B 2 B 3 B 4 B 1 B 2 H B 3 B 3 B 4 Left-factored Head-factored Right-factored (assuming H = B 2 ) 63

  45. ♦♦ More on binarization • Binarization usually produces large numbers of new nonterminals • These all appear in a certain position (e.g., end of production) • Design your parser loops and indexing so this is maximally efficient • Top-down and left-corner parsing benefit from specially designed binarization that delays choice points as long as possible A B 1 A − B 1 A B 2 A − B 1 B 2 B 1 B 2 B 3 B 4 A B 2 B 3 B 4 B 3 A − B 1 B 2 B 3 B 1 B 2 B 3 B 4 B 3 B 4 B 4 Unbinarized Right-factored Right-factored (top-down version) 64

  46. ♦♦ Markov grammars • Sometimes it can be desirable to smooth or generalize rules beyond what was actually observed in the treebank • Markov grammars systematically “forget” part of the context VP AP V... VP AP V... AP V NP PP PP V...PP V NP PP PP V...PP PP VP V NP PP V NP PP AP V NP PP PP V NP V NP Unbinarized Head-factored Markov grammar (assuming H = B 2 ) 65

  47. String positions String positions are a systematic way of representing substrings in a string. A string position of a string w = x 1 . . . x n is an integer 0 ≤ i ≤ n . A substring of w is represented by a pair ( i, j ) of string positions, where 0 ≤ i ≤ j ≤ n . w i,j represents the substring w i +1 . . . w j Howard likes mangoes 0 1 2 3 Example: w 0 , 1 = Howard , w 1 , 3 = likes mangoes , w 1 , 1 = ǫ • Nothing depends on string positions being numbers, so • this all generalizes to speech recognizer lattices , which are graphs where vertices correspond to word boundaries house arose the how us a rose 66

  48. Dynamic programming computation Assume G = ( V , S , s, R , p ) is in Chomsky Normal Form , i.e., all productions are of the form A → B C or A → x , where A, B, C ∈ S , x ∈ V . � P( ψ ) = P( s ⇒ ⋆ w ) Goal: To compute P( w ) = ψ ∈ Ψ G ( w ) Data structure: A table P( A ⇒ ⋆ w i,j ) for A ∈ S and 0 ≤ i < j ≤ n Base case: P( A ⇒ ⋆ w i − 1 ,i ) = p ( A → w i − 1 ,i ) for i = 1 , . . . , n Recursion: P( A ⇒ ⋆ w i,k ) k − 1 � � p ( A → B C )P( B ⇒ ∗ w i,j )P( C ⇒ ∗ w j,k ) = A → B C ∈R ( A ) j = i +1 Return: P( s ⇒ ⋆ w 0 ,n ) 67

  49. Dynamic programming recursion P G ( A ⇒ ∗ w i,k ) k − 1 � � p ( A → B C )P G ( B ⇒ ∗ w i,j )P G ( C ⇒ ∗ w j,k ) = A → B C ∈R ( A ) j = i +1 S A B C w i,j w j,k P G ( A ⇒ ∗ w i,k ) is called an “ inside probability ”. 68

  50. Example PCFG parse 1 . 0 S → NP VP 0 . 1 NP → NP NP 0 . 2 NP → brothers 0 . 3 NP → box 0 . 4 NP → lies 1 . 0 V → box 0 . 8 VP → V NP 0 . 2 VP → lies 1 2 3 0 NP 0.2 NP 0.006 S 0.0652 S 0.0652 NP 0.3 1 VP 0.32 NP 0.006 VP 0.32 V 1.0 V 1.0 VP 0.2 NP 0.4 2 NP 0.2 NP 0.3 NP 0.4 VP 0.2 brothers box lies 0 1 2 3 69

  51. CFG Parsing takes n 3 |R| time P G ( A ⇒ ∗ w i,k ) k − 1 � � p ( A → B C )P G ( B ⇒ ∗ w i,j )P G ( C ⇒ ∗ w j,k ) = A → B C ∈R ( A ) j = i +1 S The algorithm iterates over all rules R and all triples of string A positions 0 ≤ i < j < k ≤ n (there are n ( n − 1)( n − 2) / 6 = B C O ( n 3 ) such triples) w i,j w j,k 70

  52. PFSA parsing takes n |R| time Because FSA trees are uniformly right branching , • All non-trivial constituents end at the right edge of the sentence ⇒ The inside algorithm takes n |R| time P G ( A ⇒ ∗ w i,n ) � p ( A → B C )P G ( B ⇒ ∗ w i,i +1 )P G ( C ⇒ ∗ w i +1 ,n ) = A → B C ∈R ( A ) • The standard FSM algorithms are just CFG algorithms, restricted to right-branching structures 0 a 0 a 1 b 0 a 1 71

  53. ♦♦ Unary productions and unary closure Dealing with “one level” unary productions A → B is easy, but how do we deal with “loopy” unary productions A ⇒ + B ⇒ + A ? The unary closure matrix is C ij = P( A i ⇒ ⋆ A j ) for all A i , A j ∈ S Define U ij = p ( A i → A j ) for all A i , A j ∈ S If x is a (column) vector of inside weights, Ux is a vector of the inside weights of parses with one unary branch above x The unary closure is the sum of the inside weights with any number of unary branches: . . . (1 + U + U 2 + . . . ) x x + Ux + U 2 x + . . . = U 2 x (1 − U ) − 1 x = Ux The unary closure matrix C = (1 − U ) − 1 can be pre-computed, x so unary closure is just a matrix multiplication. Because “new” nonterminals introduced by binarization never occur in unary chains, unary closure is cheap. 72

  54. Finding the most likely parse of a string Given a string w ∈ V ⋆ , find the most likely tree � ψ = argmax ψ ∈ Ψ G ( w ) P G ( ψ ) (The most likely parse is also known as the Viterbi parse ). Claim: If we substitute “ max ” for “ + ” in the algorithm for P G ( w ) , it returns P G ( � ψ ). P G ( � A → B C ∈R ( A ) p ( A → B C )P G ( � ψ B,i,j )P G ( � ψ A,i,k ) = max max ψ C,j,k ) j = i +1 ,...,k − 1 To return � ψ , add “back-pointers” to keep track of best parse � ψ A,i,j for each A ⇒ ⋆ w i,j Implementation note: There’s no need to actually build these trees � ψ A,i,k ; rather, the back-pointers in each table entry point to the table entries for the best parse’s children 73

  55. ♦♦ Semi-ring of rule weights Our algorithms don’t actually require that the values associated with productions are probabilities . . . Our algorithms only require that productions have values in some semi-ring with operations “ ⊕ ” and “ ⊗ ” with the usual associative and distributive laws ⊕ ⊗ + × sum of probabilities or weights max × Viterbi parse max + Viterbi parse with log probabilities ∧ ∨ Categorical CFG parsing 74

  56. Topics • Graphical models and Bayes networks • Markov chains and hidden Markov models • (Probabilistic) context-free grammars • (Probabilistic) finite-state machines • Computation with PCFGs • Estimation of PCFGs • Lexicalized and bi-lexicalized PCFGs • Non-local dependencies and log-linear models • Stochastic unification-based grammars 75

  57. Maximum likelihood estimation An estimator ˆ p for parameters p ∈ P of a model P p ( X ) is a function from data D to ˆ p ( D ) ∈ P . The likelihood L D ( p ) and log likelihood ℓ D ( p ) of data D = ( x 1 . . . x n ) with respect to model parameters p is: L D ( p ) = P p ( x 1 ) . . . P p ( x n ) n � ℓ D ( p ) = log P p ( x i ) i =1 The maximum likelihood estimate (MLE) ˆ p MLE of p from D is: p MLE = argmax ˆ L D ( p ) = argmax ℓ D ( p ) p p 76

  58. ♦♦ Optimization and Lagrange multipliers ∂f ( x ) /∂x = 0 at the unconstrained optimum of f ( x ) But maximum likelihood estimation often requires optimizing f ( x ) subject to constraints g k ( x ) = 0 for k = 1 , . . . , m . Introduce Lagrange multipliers λ = ( λ 1 , . . . , λ m ), and define: m � F ( x, λ ) = f ( x ) − λ · g ( x ) = f ( x ) − λ k g k ( x ) k =1 Then at the constrained optimum, all of the following hold: m � 0 = ∂F ( x, λ ) /∂x = ∂f ( x ) /∂x − λ k ∂g k ( x ) /∂x k =1 0 = ∂F ( x, λ ) /∂λ = g ( x ) 77

  59. Biased coin example Model has parameters p = ( p h , p t ) that satisfy constraint p h + p t = 1. Log likelihood of data D = ( x 1 , . . . , x n ), x i ∈ { h, t } , is ℓ D ( p ) = log( p x 1 . . . p x n ) = n h log p h + n t log p t where n h is the number of h in D , and n t is the number of t in D . F ( p, λ ) = n h log p h + n t log p t − λ ( p h + p t − 1) 0 = ∂F/∂p h = n h /p h − λ 0 = ∂F/∂p t = n t /p t − λ From the constraint p h + p t = 1 and the last two equations: λ = n h + n t p h = n h /λ = n h / ( n h + n t ) p t = n t /λ = n t / ( n h + n t ) So the MLE is the relative frequency 78

  60. ♦♦ PCFG MLE from visible data Data: A treebank of parse trees D = ψ 1 , . . . , ψ n . n � � ℓ D ( p ) = log P G ( ψ i ) = n A → α ( D ) log p ( A → α ) i =1 A → α ∈R Introduce |S| Lagrange multipliers λ B , B ∈ S for the constraints � B → β ∈R ( B ) p ( B → β ) = 1. Then:     �  �  ℓ ( p ) −   ∂ λ B p ( B → β ) − 1 = n A → α ( D ) B ∈S B → β ∈R ( B ) p ( A → α ) − λ A ∂p ( A → α ) n A → α ( D ) Setting this to 0, p ( A → α ) = � A → α ′ ∈R ( A ) n A → α ′ ( D ) So the MLE for PCFGs is the relative frequency estimator 79

  61. Example: Estimating PCFGs from visible data S S S NP VP NP VP NP VP rice grows rice grows corn grows   S     NP VP P  = 2 / 3   Rule Count Rel Freq  S → NP VP 3 1 rice grows NP → rice 2 2 / 3   S NP → corn 1 1 / 3     VP → grows 3 1 NP VP P  = 1 / 3    corn grows 80

  62. Properties of MLE • Consistency: As the sample size grows, the estimates of the parameters converge on the true parameters • Asymptotic optimality: For large samples, there is no other consistent estimator whose estimates have lower variance • The MLEs for statistical grammars work well in practice. – The Penn Treebank has ≈ 1.2 million words of Wall Street Journal text annotated with syntactic trees – The PCFG estimated from the Penn Treebank has ≈ 15,000 rules 81

  63. ♦♦ PCFG estimation from hidden data Data: A corpus of sentences D ′ = w 1 , . . . , w n . n � � ℓ D ′ ( p ) = log P G ( w i ) . P G ( w ) = P G ( ψ ) . i =1 ψ ∈ Ψ G ( w ) � n i =1 E G [ n A → α | w i ] ∂ℓ D ′ ( p ) ∂p ( A → α ) = p ( A → α ) where the expected number of times A → α is used in the parses of w is : � E G [ n A → α | w ] = n A → α ( ψ )P G ( ψ | w ) . ψ ∈ Ψ G ( w ) Setting ∂ℓ D ′ /∂p ( A → α ) to the Lagrange multiplier λ A and imposing the constraint � B → β ∈R ( B ) p ( B → β ) = 1 yields: � n i =1 E G [ n A → α | w i ] p ( A → α ) = � � n i =1 E G [ n A → α ′ | w i ] A → α ′ ∈R ( A ) This is an iteration of the expectation maximization algorithm! 82

  64. Expectation maximization EM is a general technique for approximating the MLE when estimating parameters p from the visible data x is difficult, but estimating p from augmented data z = ( x, y ) is easier ( y is the hidden data ). The EM algorithm given visible data x : 1. guess initial value p 0 of parameters 2. repeat for i = 0 , 1 , . . . until convergence: Expectation step: For all y 1 , . . . , y n ∈ Y , generate pseudo-data ( x, y 1 ) , . . . , ( x, y n ), where ( x, y j ) has frequency P p i ( y j | x ) Maximization step: Set p i +1 to the MLE from the pseudo-data The EM algorithm finds the MLE ˆ p ( x ) = L x ( p ) of the visible data x . Sometimes it is not necessary to explicitly generate the pseudo-data ( x, y ); often it is possible to perform the maximization step directly from sufficient statistics (for PCFGs, the expected production frequencies) 83

  65. Dynamic programming for E G [ n A → B C | w ] � E G [ n A → B C | w ] = E G [ A i,k → B i,j C j,k | w ] 0 ≤ i<j<k ≤ n The expected fraction of parses of w in which A i,k rewrites as B i,j C j,k is: E G [ A i,k → B i,j C j,k | w ] P( S ⇒ ∗ w 1 ,i A w k,n ) p ( A → B C )P( B ⇒ ∗ w i,j )P( C ⇒ ∗ w j,k ) = P G ( w ) S A B C w 0 ,i w i,j w j,k w k,n 84

  66. Calculating P G ( S ⇒ ∗ w 0 ,i A w k,n ) Known as “outside probabilities” (but if G contains unary productions, they can be greater than 1). Recursion from larger to smaller substrings in w . Base case: P( S ⇒ ∗ w 0 , 0 S w n,n ) = 1 Recursion: P( S ⇒ ∗ w 0 ,j C w k,n ) = j − 1 � � P( S ⇒ ∗ w 0 ,i A w k,n ) p ( A → B C )P( B ⇒ ∗ w i,j ) i =0 A,B ∈S A → B C ∈R n � � P( S ⇒ ∗ w 0 ,j A w l,n ) p ( A → C D )P( D ⇒ ∗ w k,l ) + l = k +1 A,D ∈S A → C D ∈R 85

  67. Recursion in P G ( S ⇒ ∗ w 0 ,i A w k,n ) P( S ⇒ ∗ w 0 ,j C w k,n ) = j − 1 � � P( S ⇒ ∗ w 0 ,i A w k,n ) p ( A → B C )P( B ⇒ ∗ w i,j ) i =0 A,B ∈S A → B C ∈R n � � P( S ⇒ ∗ w 0 ,j A w l,n ) p ( A → C D )P( D ⇒ ∗ w k,l ) + l = k +1 A,D ∈S A → C D ∈R S S A A B C C D w 0 ,i w i,j w j,k w k,n w 0 ,j w j,k w k,l w l,n 86

  68. The EM algorithm for PCFGs Infer hidden structure by maximizing likelihood of visible data : 1. guess initial rule probabilities 2. repeat until convergence (a) parse a sample of sentences (b) weight each parse by its conditional probability (c) count rules used in each weighted parse, and estimate rule frequencies from these counts as before EM optimizes the marginal likelihood of the strings D = ( w 1 , . . . , w n ) Each iteration is guaranteed not to decrease the likelihood of D , but EM can get trapped in local minima. The Inside-Outside algorithm can produce the expected counts without enumerating all parses of D . When used with PFSA, the Inside-Outside algorithm is called the Forward-Backward algorithm (Inside=Backward, Outside=Forward) 87

  69. Example: The EM algorithm with a toy PCFG Initial rule probs “English” input rule prob the dog bites · · · · · · the dog bites a man VP → V 0 . 2 a man gives the dog a bone VP → V NP 0 . 2 · · · VP → NP V 0 . 2 VP → V NP NP 0 . 2 “pseudo-Japanese” input VP → NP NP V 0 . 2 the dog bites · · · · · · the dog a man bites Det → the 0 . 1 a man the dog a bone gives N → the 0 . 1 · · · V → the 0 . 1 88

  70. Probability of “English” 1 0.1 0.01 Average sentence probability 0.001 0.0001 1e-05 1e-06 0 1 2 3 4 5 Iteration 89

  71. Rule probabilities from “English” 1 VP → V NP 0.9 VP → NP V VP → V NP NP 0.8 VP → NP NP V Det → the 0.7 N → the 0.6 V → the Rule probability 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 Iteration 90

  72. Probability of “Japanese” 1 0.1 0.01 Average sentence probability 0.001 0.0001 1e-05 1e-06 0 1 2 3 4 5 Iteration 91

  73. Rule probabilities from “Japanese” 1 VP → V NP 0.9 VP → NP V VP → V NP NP 0.8 VP → NP NP V Det → the 0.7 N → the 0.6 V → the Rule probability 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 Iteration 92

  74. Learning in statistical paradigm • The likelihood is a differentiable function of rule probabilities ⇒ learning can involve small, incremental updates • Learning new structure (rules) is hard, but . . . • Parameter estimation can approximate rule learning – start with “superset” grammar – estimate rule probabilities – discard low probability rules 93

  75. Applying EM to real data • ATIS treebank consists of 1,300 hand-constructed parse trees • ignore the words (in this experiment) • about 1,000 PCFG rules are needed to build these trees S VP . VB NP NP . Show PRP NP DT JJ NNS PP ADJP me PDT the nonstop flights PP PP JJ PP all IN NP TO NP early IN NP from NNP to NNP in DT NN Dallas Denver the morning 94

  76. Experiments with EM 1. Extract productions from trees and estimate probabilities probabilities from trees to produce PCFG. 2. Initialize EM with the treebank grammar and MLE probabilities 3. Apply EM (to strings alone) to re-estimate production probabilities. 4. At each iteration: • Measure the likelihood of the training data and the quality of the parses produced by each grammar. • Test on training data (so poor performance is not due to overlearning). 95

  77. Likelihood of training strings 16000 15500 − log P G ( � w ) 15000 14500 14000 0 5 10 15 20 Iteration 96

  78. Quality of ML parses 1 Precision Recall 0.95 0.9 Parse Accuracy 0.85 0.8 0.75 0.7 0 5 10 15 20 Iteration 97

  79. Why does EM do so poorly? • EM assigns trees to strings to maximize the marginal probability of the strings, but the trees weren’t designed with that in mind • We have an “intended interpretation” of categories like NP, VP, etc., which EM has no way of knowing • Our grammar models are defective; real languages aren’t context-free • How can information about P( w ) provide information about P( ψ | w )? • . . . but no one really knows. 98

  80. Topics • Graphical models and Bayes networks • (Hidden) Markov models • (Probabilistic) context-free grammars • (Probabilistic) finite-state machines • Computation with PCFGs • Estimation of PCFGs • Lexicalized and bi-lexicalized PCFGs • Non-local dependencies and log-linear models • Stochastic unification-based grammars 99

  81. Subcategorization Grammars that merely relate categories miss a lot of important linguistic relationships. R 3 = { VP → V , VP → V NP , V → sleeps , V → likes , . . . } S S NP VP NP VP V NP Al Al V likes N *sleeps sleeps mangoes *likes Verbs and other heads of phrases subcategorize for the number and kind of complement phrases they can appear with. 100

Recommend


More recommend