Probabilistic Context-Free Probabilistic Context-Free Grammars (PCFGs) Grammars (PCFGs) Berlin Chen 2003 References: 1. Speech and Language Processing, chapter 12 2. Foundations of Statistical Natural Language Processing, chapters 11, 12 1
Parsing for Disambiguation • At least three ways to use probabilities in a parser – Probabilities for choosing between parses • Choose from among the many parses of the input sentence which ones are most likely – Probabilities for speedier parsing Parsing as Search • Use probabilities to order or prune the search space of a parser for finding the best parse more quickly – Probabilities for determining the sentence • Use a parser as a language model over a word lattice in order to determine a sequence of words that has the highest probability 2
Parsing for Disambiguation • The integration of sophisticated structural and probabilistic models of syntax is at the very cutting edge of the field – For the non-probabilistic syntax analysis • The context-free grammar (CFG) is the standard – For the probabilistic syntax analysis • No single model has become a standard • A number of probabilistic augmentations to context-free grammars – Probabilistic CFG with the CYK algorithm – Probabilistic lexicalized CFG – Dependency grammars – ……. 3
Definition of the PCFG Booth, 1969 syntactic categories • A PCFG G has five parameters lexical categories 1. A set of non-terminal symbols (or “variables”) N 2. A set of terminal symbols ∑ (disjoint from N ) words 3. A set of productions P , each of the form A →β , where A is a non-terminal symbol and β is a string of symbols from the infinite set of strings ( ∑ ∪ N )* 4. A designated start symbol S (or N 1 ) 5. Each rule in P is augmented with a conditional probability assigned by a function D A →β [ prob. ] ∑ ( ) ∀ → = A P A β 1 P ( A →β ) or P ( A →β | A ) β • A PCFG G= ( N, ∑ , P, S, D ) 4
An Example Grammar 5
Parse Trees • Input: astronomers saw stars with ears The probability of a particular parse is defined as the product of the probabilities of all the rules used to expand each node in the parse tree – An instance of PP-attachment ambiguity 6
Parse Trees • Input: dogs in houses and cats – An instance of coordination ambiguity • Which one is correct ? • However, the PCFG will assign the identical probabilities to the two parses 7
N j Basic Assumptions w 1 …….w k ………..w l ……. w n • Place Invariance c+1 words – The probability of a subtree does not depend on where in the string the words it dominates are ( ) ( ) ∀ → ζ = → ζ k P N j P N j ( ) + k k c word positions in the input string • Context free – The probability of a subtree does not depend on words not dominated by the subtree ( ) ( ) → ζ = → ζ P N j anything outside k through l P N j kl kl • Ancestor free – The probability of a subtree does not depend on nodes in the derivation outside the subtree ( ) ( ) → ζ = → ζ P N j any ancestor outside N j P N j kl kl kl 8
Basic Assumptions • Example chain rule context-free & ancestor-free assumptions Place-invariant assumption 9
Some Features of PCFGs • PCFGs give some idea (probabilities) of the plausibility of different parses – But the probability estimates are based purely on structural factors and not lexical factors • PCFGs are good for grammar induction – PCFG can be learned from data, e.g. from bracketed (labeled) corpora • PCFGs are robust – Tackle grammatical mistakes, disfluencies, and errors by ruling out nothing in the grammar, but by just giving implausible sentences a lower probability 10
Chomsky Normal Form • Chomsky Normal Form (CNF) grammars only have unary and binary rules of the form → N j N r N s For syntactic categories → N j w k For lexical categories • The parameters of a PCFG in CNF are ( ) n 3 matrix of parameters → P N i N r N s G (when n nonterminals ) ( ) n 3 + nV → P N i w k G nV matrix of parameters parameters (when n nonterminals and ∑ ( ) ∑ ( ) → + → = P N j N r N s P N i w k 1 V terminals ) r , s k • Any CFG can be represented by a weakly equivalent CFG in CNF – “ weakly equivalent ” : “ generating the same language ” • But do not assign the same phrase structure to each 11 sentence
CYK Algorithm Ney, 1991 Collins, 1999 • CYK (Cocke-Younger-Kasami) algorithm – A bottom-up parser using the dynamic programming table – Assume the PCFG is in Chomsky normal form (CNF) • Definition – w 1 … w n : an input string composed of n words – w ij : a string of words from words i to j – π [ i, j, a ]: a table entry holds the maximum probability for a constituent with non-terminal index a spaning words w i … w j N a w 1 …….w i ………..w j ……. w n 12
CYK Algorithm • Fill out the table entries by induction – Base case • Consider the input strings of length one (i.e., each ( ) individual word w i ) → P A w i * • Since the grammar is in CNF, ⇒ → A w iff A w i i – Recursive case A must be a lexical category • For strings of words of length > 1, * ⇒ → Choose the A w iff there is at least one rule A B C ij maximum among A must be a + w here B derives the first k-i 1 symbols and all possibilities syntactic category C derives the last j-k symbols and A • Compute the probability by multiplying together the B C probabilities of these two pieces (note that they i k k +1 j have been calculated in the recursion) 13
CYK Algorithm set to zero Finding the most Likely parse for a sentence A on the word-span C m -word input string B n non-terminals begin m m+1 end O(m 3 n 3 ) bookkeeping 14
Three Basic Problems for PCFGs • What is the probability of a sentence w 1 m according to a grammar G : P ( w 1 m | G )? • What is the most likely parse for a sentence ? argmax t P ( t | w 1 m ,G ) • How can we choose the rule probabilities for the grammar G that maximize the probability of a sentence? argmax G P ( w 1 m | G ) Training the PCFG 15
The Inside-Outside Algorithm Baker 1979 • A generalization of the forward-backward Young 1990 algorithm of HMMs • A dynamic programming technique used to efficiently compute PCFG probabilities – Inside and outside probabilities in PCFG 16
The Inside-Outside Algorithm • Definition ( ) ( ) β = – Inside probability p , q P w N j , G j pq pq • The total probability of generating words w p …w q given that one is starting off with the nonterminal N j ( ) ( ) α = p , q P w , N j , w G – Outside probability − + j 1 ( p 1 ) pq ( q 1 ) m • The total probability of beginning with the start symbol N 1 and generating the nonterminal N jpq and all the words outside w p …w q 17
Problem 1: The Probability of a Sentence • A PCFG with the Chomsky Normal Form was used here • The total probability of a sentence expressed by the inside algorithm ( ) ( ) ( ) ( ) = ⇒ = = β P w G P N 1 w G P w N 1 , G 1 , m 1 m 1 m 1 m 1 m 1 • The probability of the base case word-span=1 ( ) ( ) ( ) ( ) β = = → = k , k P w N j , G P N j w G P w N 1 , G j k kk k 1 m 1 m ( ) β • Find the probabilities by induction (or p , q j by recursion) word-span > 1 18
Problem 1: The Probability of a Sentence ( ) • Find the probabilities by induction β p , q j – A bottom-up version of calculation ∀ ≤ < ≤ j , 1 p q m ( ) ( ) ( ) β = ⇒ = p , q P N j w G P w N j , G j pq pq pq pq ( ) q − 1 ∑ ∑ = r s j P w , N , w , N N , G ( ) ( ) + + pd pd d 1 q d 1 q pq = r , s d p ( ) q − 1 ∑ ∑ = P w , N r , w , N s N j , G ( ) ( ) + + pd pd d 1 q d 1 q pq = chain rule r , s d p ( ) ( ) − q 1 ∑ ∑ = × P N r , N s N j , G P w N j , N r , N s , G ( ) ( ) + + pd d 1 q pq pd pq pd d 1 q context-free & = r , s d p ( ) × P w N j , N r , N s , w , G ancestor-free ( ) ( ) + + d 1 q pq pd d 1 q pd assumptions ( ) ( ) ( ) − q 1 ∑ ∑ = × × P N r , N s N j , G P w N r , G P w N s , G ( ) ( ) ( ) + + + pd d 1 q pq pd pd d 1 q d 1 q = r , s d p Place-invariant − q 1 ( ) ∑ ∑ ( ) ( ) = → × β × β + P N j N r N s p , d d 1 , q assumption r s = r , s d p the binary rule 19
Problem 1: The Probability of a Sentence • Example end begin ( ) ( ) ( ) ( ) ( ) ( ) ( ) β = → β β + → β β 2 , 5 P VP V NP 2 , 2 3 , 5 P VP VP PP 2 , 3 4 , 5 VP V NP VP PP 0.7 1.0 0.01296 0.126 0.18 0.015876 0.3 ( ) ( ) ( ) ( ) β = P → β β 1 , 5 S NP VP 1 , 1 2 , 5 S NP VP 0.0015867 1.0 0.1 0.015867 20
Recommend
More recommend