Lecture 17: Statistical Parsing with PCFG Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501-NLP 1
Reading list v Look at Mike Collins’ note on PCFGs and lexicalized PCFG http://www.cs.columbia.edu/~mcollins/ CS6501-NLP 2
Phrase structure (constituency) trees v Can be modeled by Context-free grammars CS6501-NLP 3
CKY algorithm § for J := 1 to n § Add to [J-1,J] all categories for the J th word § for width := 2 to n § for start := 0 to n-width // this is I § Define end := start + width // this is J § for mid := start+1 to end-1 // find all I-to-J phrases § for every rule X à Y Z in the grammar if Y in [start,mid] and Z in [mid,end] then add X to [start,end] CS6501-NLP 4
Weighted CKY: Viterbi algorithm • initialize all entries of chart to ∞ • for i := 1 to n • for each rule R of the form X à word[i] • chart[X,i-1,i] max ( weight(R) ) • for width := 2 to n Assume the weights • for start := 0 to n-width are log probabilities • Define end := start + width of rules • for mid := start+1 to end-1 • for each rule R of the form X à Y Z • chart[X,start,end] = max( weight(R) + chart[Y,start,mid] + chart[Z,mid,end]) • return chart[ROOT,0,n] CS6501-NLP 5 Slides are modified from Jason Eisner’s NLP course
Likelihood of a parse tree WHY?? CS6501-NLP 6
Probabilistic Trees v Just like language models or HMM for POS tagging v We make independent assumptions! S NP VP time PP VP flies P NP like Det N an arrow 7 CS6501-NLP
Chain rule: One word at a time p(time flies like an arrow) = p(time) * p(flies | time) * p(like | time flies) * p(an | time flies like) * p(arrow | time flies like an) CS6501-NLP 8
Chain rule + Indep. assumptions (to get trigram model) p(time flies like an arrow) = p(time) * p(flies | time) * p(like | time flies) * p(an | time flies like) * p(arrow | time flies like an) CS6501-NLP 9
Chain rule – written differently p(time flies like an arrow) = p(time) * p(time flies | time) * p(time flies like | time flies) * p(time flies like an | time flies like) * p(time flies like an arrow | time flies like an ) Proof: p(x,y | x) = p(x | x) * p(y | x, x) = 1 * p(y | x) CS6501-NLP 10
Chain rule + Indep. assumptions p(time flies like an arrow) = p(time) * p(time flies | time) * p(time flies like | time flies) * p(time flies like an | time flies like) * p(time flies like an arrow | time flies like an ) Proof: p(x,y | x) = p(x | x) * p(y | x, x) = 1 * p(y | x) CS6501-NLP 11
Chain rule: One node at a time p(time) S S S S NP VP NP VP | S ) * p( VP | NP VP ) | S ) = p( p( time NP PP VP time flies P NP S S p(flies, time|time) * p( | VP ) like Det N NP NP VP an arrow time time PP VP S S * p( | VP ) * … NP NP VP time time PP VP PP VP CS6501-NLP 12 flies
Chain rule + Indep. assumptions S S S S NP VP NP VP | S ) * p( VP | NP VP ) | S ) = p( p( time NP PP VP time flies P NP S S * p( | VP ) like Det N NP NP VP an arrow time time PP VP S S * p( | VP ) * … NP NP VP time time PP VP PP VP CS6501-NLP 13 flies
Simplified notation S NP VP | S ) = p( S → NP VP | S ) * p( NP → time | NP ) p( time PP VP flies P NP * p( VP → VP NP | VP ) like Det N an arrow * p( VP → flies | VP ) * … CS6501-NLP 14
Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501-NLP 15
Phrase Structure Trees Three basic problems for HMMs v Likelihood of the input: v Inside algorithm How likely the sentence ”I love cat” occurs v Decoding (Parsing) the input: v CKY algorithm Parse tree of ”I love cat” v Estimation (Learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Inside-Outside algorithm CS6501-NLP 16
Phrase Structure Trees Three basic problems for HMMs v Likelihood of the input: v Inside algorithm How likely the sentence ”I love cat” occurs v Decoding (Parsing) the input: v CKY algorithm Parse tree of ”I love cat” v Estimation (Learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Inside-Outside algorithm CS6501-NLP 17
Phrase Structure Trees Three basic problems for HMMs v Likelihood of the input: v Inside algorithm How likely the sentence ”I love cat” occurs v Decoding (Parsing) the input: v CKY algorithm Parse tree of ”I love cat” v Estimation (Learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Inside-Outside algorithm CS6501-NLP 18
Probabilistic CKY: Inside algorithm • initialize all entries of chart to 0 • for i := 1 to n • for each rule R of the form X à word[i] • chart[X,i-1,i] += prob(R) • for width := 2 to n • for start := 0 to n-width • Define end := start + width • for mid := start+1 to end-1 • for each rule R of the form X à Y Z • chart[X,start,end] += prob(R) * chart[Y,start,mid] * chart[Z,mid,end] • return chart[ROOT,0,n] 600.465 - Intro to NLP - J. Eisner 19
S à NP VP How to build a width-6 phrase NP à Det N NP à NP PP VP à V NP VP à VP PP PP à P NP ? 1 7 = 1 2 + 2 7 1 3 + 3 7 1 4 + 4 7 1 5 + 5 7 1 6 + 6 7 CS6501: NLP 20
CKY: Recognition algorithm v initialize all entries of chart to false v for i := 1 to n v for each rule R of the form X à word[i] v chart[X,i-1,i] |= in_grammar(R) v for width := 2 to n v for start := 0 to n-width Pay attention to the orange code … v Define end := start + width v for mid := start+1 to end-1 v for each rule R of the form X à Y Z v chart[X,start,end] |= in_grammar(R) & chart[Y,start,mid] & chart[Z,mid,end] v return chart[ROOT,0,n] 600.465 - Intro to NLP - J. Eisner 21
Weighted CKY: Viterbi algorithm (min-cost) v initialize all entries of chart to ∞ v for i := 1 to n v for each rule R of the form X à word[i] v chart[X,i-1,i] min= weight(R) v for width := 2 to n Pay attention to the v for start := 0 to n-width orange code … v Define end := start + width v for mid := start+1 to end-1 v for each rule R of the form X à Y Z v chart[X,start,end] min= weight(R) + chart[Y,start,mid] + chart[Z,mid,end] v return chart[ROOT,0,n] 600.465 - Intro to NLP - J. Eisner 22
Weighted CKY: Viterbi algorithm (max-prob) v initialize all entries of chart to 0 v for i := 1 to n v for each rule R of the form X à word[i] v chart[X,i-1,i] max= weight(R) v for width := 2 to n Pay attention to the v for start := 0 to n-width orange code … v Define end := start + width v for mid := start+1 to end-1 v for each rule R of the form X à Y Z v chart[X,start,end] max= weight(R) * chart[Y,start,mid] * chart[Z,mid,end] v return chart[ROOT,0,n] 600.465 - Intro to NLP - J. Eisner 23
Weighted CKY: Viterbi algorithm (max-logprob) v initialize all entries of chart to - ∞ v for i := 1 to n v for each rule R of the form X à word[i] v chart[X,i-1,i] max= weight(R) v for width := 2 to n Pay attention to the v for start := 0 to n-width orange code … v Define end := start + width v for mid := start+1 to end-1 v for each rule R of the form X à Y Z v chart[X,start,end] max= weight(R) + chart[Y,start,mid] + chart[Z,mid,end] v return chart[ROOT,0,n] 600.465 - Intro to NLP - J. Eisner 24
Probabilistic CKY: Inside algorithm • initialize all entries of chart to 0 • for i := 1 to n • for each rule R of the form X à word[i] • chart[X,i-1,i] += prob(R) • for width := 2 to n • for start := 0 to n-width • Define end := start + width • for mid := start+1 to end-1 • for each rule R of the form X à Y Z • chart[X,start,end] += prob(R) * chart[Y,start,mid] * chart[Z,mid,end] • return chart[ROOT,0,n] 600.465 - Intro to NLP - J. Eisner 25
Semiring-weighted CKY: General algorithm! ⊗ is like “and”/ ∀ : • initialize all entries of chart to combines all of several • for i := 1 to n pieces into an X • for each rule R of the form X à word[i] ⊕ is like “or”/ ∃ : considers the • chart[X,i-1,i] ⊕ = semiring_weight(R) alternative ways to • for width := 2 to n build the X • for start := 0 to n-width • Define end := start + width • for mid := start+1 to end-1 • for each rule R of the form X à Y Z • chart[X,start,end] ⊕ = semiring_weight(R) ⊗ chart[Y,start,mid] ⊗ chart[Z,mid,end] • return chart[ROOT,0,n] 600.465 - Intro to NLP - J. Eisner 26
More recommend