NLP Programming Tutorial 11 – The Structured Perceptron NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1
NLP Programming Tutorial 11 – The Structured Perceptron Prediction Problems Given x, predict y A book review Is it positive? Binary Oh, man I love this book! Prediction yes (2 choices) This book is so boring... no A tweet Its language Multi-class On the way to the park! English Prediction (several choices) Japanese 公園に行くなう! A sentence Its syntactic parse S VP Structured I read a book NP Prediction (millions of choices) N VBD DET NN 2 I read a book
NLP Programming Tutorial 11 – The Structured Perceptron Prediction Problems Given x, predict y A book review Is it positive? Binary Oh, man I love this book! Prediction yes (2 choices) This book is so boring... no A tweet Its language Multi-class On the way to the park! English Prediction (several choices) Japanese 公園に行くなう! Most NLP A sentence Its syntactic parse Problems! S VP Structured I read a book NP Prediction (millions of choices) N VBD DET NN 3 I read a book
NLP Programming Tutorial 11 – The Structured Perceptron So Far, We Have Learned Classifiers Generative Models Perceptron, SVM, Neural Net HMM POS Tagging CFG Parsing Lots of features Conditional probabilities Binary prediction Structured prediction 4
NLP Programming Tutorial 11 – The Structured Perceptron Structured Perceptron Classifiers Generative Models Perceptron, SVM, Neural Net HMM POS Tagging CFG Parsing Lots of features Conditional probabilities Binary prediction Structured prediction Structured perceptron → Classification with lots of features over structured models! 5
NLP Programming Tutorial 11 – The Structured Perceptron Uses of Structured Perceptron (or Variants) ● POS Tagging with HMMs Collins “Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms” ACL02 ● Parsing Huang+ “Forest Reranking: Discriminative Parsing with Non-Local Features” ACL08 ● Machine Translation Liang+ “An End-to-End Discriminative Approach to Machine Translation” ACL06 (Neubig+ “Inducing a Discriminative Parser for Machine Translation Reordering, EMNLP12”, Plug :) ) ● Discriminative Language Models Roark+ “Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm” ACL04 6
NLP Programming Tutorial 11 – The Structured Perceptron Example: Part of Speech (POS) Tagging ● Given a sentence X, predict its part of speech sequence Y Natural language processing ( NLP ) is a field of computer science JJ NN NN -LRB- NN -RRB- VBZ DT NN IN NN NN ● A type of structured prediction 7
NLP Programming Tutorial 11 – The Structured Perceptron Hidden Markov Models (HMMs) for POS Tagging ● POS→POS transition probabilities I + 1 P ( Y )≈ ∏ i = 1 ● Like a bigram model! P T ( y i ∣ y i − 1 ) ● POS→Word emission probabilities I P ( X ∣ Y )≈ ∏ 1 P E ( x i ∣ y i ) P T (JJ|<s>) P T (NN|JJ) P T (NN|NN) … * * <s> JJ NN NN LRB NN RRB ... </s> natural language processing ( nlp ) ... P E (natural|JJ) P E (language|NN) P E (processing|NN) * * … 8
NLP Programming Tutorial 11 – The Structured Perceptron Why are Features Good? ● Can easily try many different ideas ● Are capital letters usually nouns? ● Are words that end with -ed usually verbs? -ing? 9
NLP Programming Tutorial 11 – The Structured Perceptron Restructuring HMM With Features I I + 1 P ( X ,Y )= ∏ 1 P E ( x i ∣ y i ) ∏ i = 1 P T ( y i ∣ y i − 1 ) Normal HMM: 10
NLP Programming Tutorial 11 – The Structured Perceptron Restructuring HMM With Features I I + 1 P ( X ,Y )= ∏ 1 P E ( x i ∣ y i ) ∏ i = 1 P T ( y i ∣ y i − 1 ) Normal HMM: I I + 1 log P ( X ,Y )= ∑ 1 log P E ( x i ∣ y i ) ∑ i = 1 log P T ( y i ∣ y i − 1 ) Log Likelihood: 11
NLP Programming Tutorial 11 – The Structured Perceptron Restructuring HMM With Features I I + 1 P ( X ,Y )= ∏ 1 P E ( x i ∣ y i ) ∏ i = 1 P T ( y i ∣ y i − 1 ) Normal HMM: I I + 1 log P ( X ,Y )= ∑ 1 log P E ( x i ∣ y i ) ∑ i = 1 log P T ( y i ∣ y i − 1 ) Log Likelihood: I I + 1 S ( X ,Y )= ∑ 1 w E , y i ,x i ∑ i = 1 w T ,y i − 1 , y i Score 12
NLP Programming Tutorial 11 – The Structured Perceptron Restructuring HMM With Features I I + 1 P ( X ,Y )= ∏ 1 P E ( x i ∣ y i ) ∏ i = 1 P T ( y i ∣ y i − 1 ) Normal HMM: I I + 1 log P ( X ,Y )= ∑ 1 log P E ( x i ∣ y i ) ∑ i = 1 log P T ( y i ∣ y i − 1 ) Log Likelihood: I I + 1 S ( X ,Y )= ∑ 1 w E , y i ,x i ∑ i = 1 w E , y i − 1 , y i Score w E , y i ,x i = log P E ( x i ∣ y i ) w T ,y i − 1 , y i = log P T ( y i ∣ y i − 1 ) When: log P(X,Y) = S(X,Y) 13
NLP Programming Tutorial 11 – The Structured Perceptron Example I visited Nara φ( ) = PRP VBD NNP φ T,<S>,PRP (X,Y 1 ) = 1 φ T,PRP,VBD (X,Y 1 ) = 1 φ T,VBD,NNP (X,Y 1 ) = 1 φ T,NNP,</S> (X,Y 1 ) = 1 φ E,PRP,”I” (X,Y 1 ) = 1 φ E,VBD,”visited” (X,Y 1 ) = 1 φ E,NNP,”Nara” (X,Y 1 ) = 1 φ CAPS,PRP (X,Y 1 ) = 1 φ CAPS,NNP (X,Y 1 ) = 1 φ SUF,VBD,”...ed” (X,Y 1 ) = 1 I visited Nara φ( ) = NNP VBD NNP φ T,<S>,NNP (X,Y 1 ) = 1 φ T,NNP,VBD (X,Y 1 ) = 1 φ T,VBD,NNP (X,Y 1 ) = 1 φ T,NNP,</S> (X,Y 1 ) = 1 φ E,NNP,”I” (X,Y 1 ) = 1 φ E,VBD,”visited” (X,Y 1 ) = 1 φ E,NNP,”Nara” (X,Y 1 ) = 1 φ CAPS,NNP (X,Y 1 ) = 2 φ SUF,VBD,”...ed” (X,Y 1 ) = 1 14
NLP Programming Tutorial 11 – The Structured Perceptron Finding the Best Solution ● We must find the POS sequence that satisfies: Y = argmax Y ∑ i w i ϕ i ( X ,Y ) ̂ 15
NLP Programming Tutorial 11 – The Structured Perceptron Remember: HMM Viterbi Algorithm ● Forward step, calculate the best path to a node ● Find the path to each node with the lowest negative log probability ● Backward step, reproduce the path ● This is easy, almost the same as word segmentation 16
NLP Programming Tutorial 11 – The Structured Perceptron Forward Step: Part 1 ● First, calculate transition from <S> and emission of the first word for every POS I 1:NN 0:<S> best_score[“1 NN”] = -log P T (NN|<S>) + -log P E (I | NN) 1:JJ best_score[“1 JJ”] = -log P T (JJ|<S>) + -log P E (I | JJ) 1:VB best_score[“1 VB”] = -log P T (VB|<S>) + -log P E (I | VB) 1:PRP best_score[“1 PRP”] = -log P T (PRP|<S>) + -log P E (I | PRP) 1:NNP best_score[“1 NNP”] = -log P T (NNP|<S>) + -log P E (I | NNP) … 17
NLP Programming Tutorial 11 – The Structured Perceptron Forward Step: Middle Parts ● For middle words, calculate the minimum score for all possible previous POS tags I visited best_score[“2 NN”] = min( 1:NN 2:NN best_score[“1 NN”] + -log P T (NN|NN) + -log P E (visited | NN), best_score[“1 JJ”] + -log P T (NN|JJ) + -log P E (language | NN), 1:JJ 2:JJ best_score[“1 VB”] + -log P T (NN|VB) + -log P E (language | NN), best_score[“1 PRP”] + -log P T (NN|PRP) + -log P E (language | NN), 1:VB 2:VB best_score[“1 NNP”] + -log P T (NN|NNP) + -log P E (language | NN), ... ) 1:PRP 2:PRP best_score[“2 JJ”] = min( best_score[“1 NN”] + -log P T (JJ|NN) + -log P E (language | JJ), 1:NNP 2:NNP best_score[“1 JJ”] + -log P T (JJ|JJ) + -log P E (language | JJ), … … best_score[“1 VB”] + -log P T (JJ|VB) + -log P E (language | JJ), 18 ...
NLP Programming Tutorial 11 – The Structured Perceptron HMM Viterbi with Features ● Same as probabilities, use feature weights I 1:NN 0:<S> best_score[“1 NN”] = w T,<S>,NN + w E,NN,I 1:JJ best_score[“1 JJ”] = w T,<S>,JJ + w E,JJ,I 1:VB best_score[“1 VB”] = w T,<S>,VB + w E,VB,I 1:PRP best_score[“1 PRP”] = w T,<S>,PRP + w E,PRP,I 1:NNP best_score[“1 NNP”] = w T,<S>,NNP + w E,NNP,I … 19
NLP Programming Tutorial 11 – The Structured Perceptron HMM Viterbi with Features ● Can add additional features I 1:NN 0:<S> best_score[“1 NN”] = w T,<S>,NN + w E,NN,I + w CAPS,NN 1:JJ best_score[“1 JJ”] = w T,<S>,JJ + w E,JJ,I + w CAPS,JJ 1:VB best_score[“1 VB”] = w T,<S>,VB + w E,VB,I + w CAPS,VB 1:PRP best_score[“1 PRP”] = w T,<S>,PRP + w E,PRP,I + w CAPS,PRP 1:NNP best_score[“1 NNP”] = w T,<S>,NNP + w E,NNP,I + w CAPS,NNP … 20
NLP Programming Tutorial 11 – The Structured Perceptron Learning In the Structured Perceptron ● Remember the perceptron algorithm ● If there is a mistake: w ← w + y ϕ( x ) ● Update weights to: increase score of positive examples decrease score of negative examples ● What is positive/negative in structured perceptron? 21
NLP Programming Tutorial 11 – The Structured Perceptron Learning in the Structured Perceptron ● Positive example, correct feature vector: I visited Nara φ( ) PRP VBD NNP ● Negative example, incorrect feature vector: I visited Nara φ( ) NNP VBD NNP 22
Recommend
More recommend