structured prediction
play

Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence Labeling One tag for one word


  1. CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/

  2. An Example Structured Prediction Problem: Sequence Labeling

  3. Sequence Labeling • One tag for one word • e.g. Part of speech tagging hate I movie this PRP VBP DT NN • e.g. Named entity recognition The movie featured Keanu Reeves O O O B-PER I-PER

  4. Sequence Labeling as Independent Classification I hate this movie <s> <s> classifier classifier classifier classifier PRP VBP DT NN

  5. Locally Normalized Models I this movie <s> <s> hate classifier classifier classifier classifier PRP VBP DT NN

  6. Summary • Independent classification models • Strong independent assumption 𝑀 𝑄 𝑍 𝑌 = 𝑄(𝑧 𝑗 |𝑌) 𝑗=1 • No guarantee of valid (consistent) structured outputs • BIO tagging scheme in NER • Locally normalized models (e.g. history-based RNN, seq2seq) • Prior order 𝑀 𝑄 𝑍 𝑌 = 𝑄(𝑧 𝑗 |𝑌, 𝑧 <𝑗 ) 𝑗=1 • Approximating decoding • Greedy search • Beam search • Label bias

  7. Globally normalized models? • Not too strong independent assumption (local dependencies) • Optimal decoding

  8. Globally normalized models? • Not too strong independent assumption (local dependencies) • Optimal decoding Conditional Random Fields (CRFs)

  9. Globally Normalized Models • Each output sequence has a score, which is not normalized over a particular decision exp(𝑇 𝑍, 𝑌 ) 𝜔(𝑍, 𝑌) 𝑄 𝑍 𝑌 = 𝑍′ exp(𝑇 𝑍 ′ , 𝑌 ) = 𝑍′ 𝜔(𝑍 ′ , 𝑌) where 𝜔(𝑍, 𝑌) are potential functions.

  10. Conditional Random Fields General form of globally normalized model First-order linear CRF y n- y 1 y 2 y 3 y n y 1 y 2 y 3 y n 1 x x 𝑀 𝜔(𝑍, 𝑌) 𝑗=1 𝜔 𝑗 (𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌) 𝑄 𝑍 𝑌 = 𝑄 𝑍 𝑌 = 𝑍′ 𝜔(𝑍 ′ , 𝑌) 𝑀 𝑍′ 𝑗=1 𝜔 𝑗 (𝑧′ 𝑗−1 , 𝑧′ 𝑗 , 𝑌)

  11. Potential Functions • 𝜔 𝑗 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌 = exp 𝑋 𝑈 𝑈 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌, 𝑗 +𝑉 𝑈 𝑇 𝑧 𝑗 , 𝑌, 𝑗 + 𝑐 𝑧 𝑗−1 ,𝑧 𝑗 • Using neural features in DNN: 𝑈 𝐺 𝑌, 𝑗 + 𝑐 𝑧 𝑗−1 ,𝑧 𝑗 𝑈 𝜔 𝑗 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌 = exp 𝑋 𝐺 𝑌, 𝑗 +𝑉 𝑧 𝑗 𝑧 𝑗−1 ,𝑧 𝑗 • Number of parameters: 𝑃( 𝑍 2 𝑒 𝐺 ) • Simpler version: 𝑈 𝐺 𝑌, 𝑗 + 𝑐 𝑧 𝑗−1 ,𝑧 𝑗 𝜔 𝑗 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌 = exp 𝑋 𝑧 𝑗−1 ,𝑧 𝑗 + 𝑉 𝑧 𝑗 • Number of parameters: 𝑃( 𝑍 2 + |𝑍|𝑒 𝐺 )

  12. BiLSTM-CRF for Sequence Labeling <s> I hate this movie <s> PRP VBP DT NN

  13. Training &Decoding of CRF Viterbi Algorithm

  14. CRF Training & Decoding 𝑀 𝑀 𝑗=1 𝑗=1 𝜔 𝑗 (𝑧 𝑗−1 ,𝑧 𝑗 ,𝑌) 𝜔 𝑗 (𝑧 𝑗−1 ,𝑧 𝑗 ,𝑌) • 𝑄 𝑍 𝑌 = 𝜔 𝑗 (𝑧′ 𝑗−1 ,𝑧′ 𝑗 ,𝑌) = 𝑀 𝑍′ 𝑗=1 𝑎(𝑌) • Training: computing the partition function Z(X) 𝑀 𝑎 𝑌 = 𝜔 𝑗 (𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌) 𝑍 𝑗=1 • Decoding 𝑧 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑍 𝑄(𝑍|𝑌) Go through the output space of Y which grows exponentially with the length of the input sequence.

  15. Interactions 𝑀 𝑎 𝑌 = 𝜔 𝑗 (𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌) 𝑍 𝑗=1 • Each label depends on the input, and the nearby labels • But given adjacent labels, others do not matter • If we knew the score of every sequence 𝑧 1 , … , 𝑧 𝑜−1 , we could compute easily the score of sequence 𝑧 1 , … , 𝑧 𝑜−1 , 𝑧 𝑜 • So we really only need to know the score of all the sequences ending in each 𝑧 𝑜−1 • Think of that as some “ precalculation ” that happens before we think about 𝑧 𝑜

  16. Viterbi Algorithm • 𝜌 𝑢 (𝑧 |X) is the partition of sequence with length equal to 𝑢 and end with label 𝑧 : 𝑢−1 𝜌 𝑢 𝑧 𝑌 = 𝜔 𝑗 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌 𝜔 𝑢 (𝑧 𝑢−1 , 𝑧 𝑢 = 𝑧, 𝑌) 𝑧 𝑗 ,…,𝑧 𝑢−1 𝑗=1 𝑢−2 = 𝜔 𝑢 (𝑧 𝑢−1 , 𝑧 𝑢 = 𝑧, 𝑌) 𝜔 𝑗 𝑧 𝑗−1 , 𝑧 𝑗 , 𝑌 𝜔 𝑢−1 (𝑧 𝑢−2 , 𝑧 𝑢−1 , 𝑌) 𝑧 𝑢−1 𝑧 𝑗 ,…,𝑧 𝑢−2 𝑗=1 = 𝜔 𝑢 (𝑧 𝑢−1 , 𝑧 𝑢 = 𝑧, 𝑌)𝜌 𝑢−1 𝑧 𝑢−1 𝑌 𝑧 𝑢−1 • Computing partition function 𝑎 𝑌 = 𝑧 𝜌 𝑀 (𝑧|𝑌)

  17. Step: Initial Part  First, calculate transition from <S> and emission of the first word for every POS natural score [“1 NN”] = T(NN|<S>) + S(natural | NN) 1:NN 0:<S> score [“1 JJ”] = T(JJ|<S>) + S(natural | JJ) 1:JJ score [“1 VB”] = T(VB|<S>) + S(natural | VB) 1:VB score [“1 LRB”] = T(LRB|<S>) + S(natural | LRB) 1:LRB score [“1 RRB”] = T(RRB|<S>) + S(natural | RRB) 1:RRB …

  18. Step: Middle Parts  For middle words, calculate the scores for all possible previous POS tags natural language score [“2 NN”] = log_sum_exp( 1:NN 2:NN score [“1 NN”] + T(NN|NN) + S(language | NN), score [“1 JJ”] + T(NN|JJ) + S(language | NN), score [“1 VB”] + T(NN|VB) + S(language | NN), 1:JJ 2:JJ score [“1 LRB”] + T(NN|LRB) + S(language | NN), score [“1 RRB”] + T(NN|RRB) + S(language | NN), 1:VB 2:VB ...) score [“2 JJ”] = log_sum_exp( 1:LRB 2:LRB score [“1 NN”] + T(JJ|NN) + S(language | JJ), score [“1 JJ”] + T(JJ|JJ) + S(language | JJ), score [“1 VB”] + T(JJ|VB) + S(language | JJ), 1:RRB 2:RRB ... … … 𝑚𝑝𝑕 𝑡𝑣𝑛 𝑓𝑦𝑞(𝑦, 𝑧) = log(exp 𝑦 + exp 𝑧 )

  19. Forward Step: Final Part  Finish up the sentence with the sentence final symbol science score [“ I+1 </S>”] = log_sum_exp( I :NN I+1 :</S> score [“ I NN”] + T(</S>|NN), score [“ I JJ”] + T(</S>|JJ), I :JJ score [“ I VB”] + T(</S>|VB), score [“ I LRB”] + T(</S>|LRB), I :VB score [“ I NN”] + T(</S>|RRB), ... ) I :LRB I :RRB …

  20. Viterbi Algorithm • Decoding is performed with similar dynamic programming algorithm • Calculating gradient: 𝑚 𝑁𝑀 𝑌, 𝑍; 𝜄 = − log 𝑄(𝑍|𝑌; 𝜄) 𝜖𝑚 𝑁𝑀 (𝑌, 𝑍; 𝜄) = 𝐺 𝑍, 𝑌 − 𝐹 𝑄 𝑍 𝑌; 𝜄 𝐺(𝑍, 𝑌) 𝜖𝜄 • Forward-backward algorithm (Sutton and McCallum, 2010) • Both 𝑄 𝑍 𝑌; 𝜄 and 𝐺(𝑍, 𝑌) can be decomposed • Need to compute the marginal distribution: 𝑄 𝑧 𝑗−1 = 𝑧 ′ , 𝑧 𝑗 = 𝑧 𝑌; 𝜄 = 𝛽 𝑗−1 𝑧 ′ 𝑌 𝜔 𝑗 𝑧 ′ , 𝑧, 𝑌 𝛾 𝑗 (𝑧|𝑌) 𝑎(𝑌) • Not necessary if using DNN framework (auto-grad)

  21. Case Study BiLSTM-CNN-CRF for Sequence Labeling

  22. Case Study: BiLSTM-CNN-CRF for Sequence Labeling (Ma et al, 2016) • Goal: Build a truly end-to-end neural model for sequence labeling task, requiring no feature engineering and data pre-processing. • Two levels of representations • Character-level representation: CNN • Word-level representation: Bi-directional LSTM

  23. CNN for Character-level representation • We used CNN to extract morphological information such as prefix or suffix of a word

  24. Bi-LSTM-CNN-CRF • We used Bi-LSTM to model word-level information. • CRF is on top of Bi-LSTM to consider the co-relation between labels.

  25. Training Details • Optimization Algorithm: • SGD with momentum (0.9) • Learning rate decays with rate 0.05 after each epoch. • Dropout Training: • Applying dropout to regularize the model with fixed dropout rate 0.5 • Parameter Initialization: • Parameters: Glorot and Bengio (2010) • Word Embedding: Stanford’s GloVe 100-dimentional embeddings 3 3 • Character Embedding: uniformly sampled from [− 𝑒𝑗𝑛 , + 𝑒𝑗𝑛 ] , where 𝑒𝑗𝑛 = 30

  26. Experiments POS NER Dev Test Dev Test Acc. Acc. Prec. Recall F1 Prec. Recall F1 Model BRNN 96.56 96.76 92.04 89.13 90.56 87.05 83.88 85.44 BLSTM 96.88 96.93 92.31 90.85 91.57 87.77 86.23 87.00 BLSTM-CNN 97.34 97.33 92.52 93.64 93.07 88.53 90.21 89.36 BLSTM-CNN-CRF 97.46 97.55 94.85 94.63 94.74 91.35 91.06 91.21

  27. Considering Rewards during Training

  28. Reward Functions in Structured Prediction • POS tagging: token-level accuracy • NER: F1 score • Dependency parsing: labeled attachment score • Machine translation: corpus-level BLEU Do different reward functions impact our decisions?

  29. • Data1: 𝑌, 𝑍 ∼ 𝑄 • Task1: predict 𝑍 given 𝑌 i.e. ℎ 1 (𝑌) • Reward1: 𝑆 1 (ℎ 1 𝑌 , 𝑍) • Data2: 𝑌, 𝑍 ∼ 𝑄 • Task2: predict 𝑍 given 𝑌 i.e. ℎ 2 (𝑌) • Reward2: 𝑆 2 (ℎ 2 𝑌 , 𝑍) ℎ 1 𝑌 = ℎ 2 𝑌 ?

  30. Predictor $0? $0 $1M? $1M Reward is the amount of money we get

  31. Predictor $0 $0 $1M $1M

  32. Predictor $0 $0 $1M $1B Reward is the amount of money we get

  33. Predictor $0 $0 $1M $1B

  34. Considering Rewards during Training • Max-Margin (Taskar et al., 2004) • Similar to cost-augmented hinge loss (last class) • Do not rely on a probabilistic model (only decoding algorithm is required) • Minimum Risk Training (Shen et al., 2016) • Reward-augmented Maximum Likelihood (Norouzi et al., 2016)

Recommend


More recommend