sparse and constrained attention for neural machine
play

Sparse and Constrained Attention for Neural Machine Translation - PowerPoint PPT Presentation

Sparse and Constrained Attention for Neural Machine Translation Chaitanya Malaviya 1 , Pedro Ferreira 2 , Andr F.T. Martins 2,3 1 Carnegie Mellon University, 2 Instituto Superior Tcnico, 3 Unbabel 1 Adequacy in Neural Machine


  1. Sparse and Constrained Attention 
 for 
 Neural Machine Translation Chaitanya Malaviya 1 , Pedro Ferreira 2 , André F.T. Martins 2,3 1 Carnegie Mellon University, 2 Instituto Superior Técnico, 3 Unbabel 1

  2. Adequacy in Neural Machine Translation Source: und wir benutzen dieses wort mit solcher verachtung . Repetitions Reference: and we say that word with such contempt . Translation: and we use this word with such contempt contempt . Ein 28-jähriger Koch, der kürzlich nach Pittsburgh 
 Source: gezogen war, wurde diese Woche im Treppenhaus 
 eines örtlichen Einkaufszentrums tot aufgefunden . A 28-year-old chef who recently moved to 
 Dropped words Pittsburgh was found dead in the 
 Reference: staircase of a local shopping mall this week . A 28-year-old chef who recently moved to 
 Translation: Pittsburgh was found dead in the 
 staircase this week . 2

  3. Previous Work • Conditioning on coverage vectors to track 
 attention history (Mi, 2016 ; Tu, 2016). • Gating architectures and adaptive attention to control 
 amount of source context (Tu, 2017; Li & Zhu, 2017). • Reconstruction Loss (Tu, 2017). • Coverage penalty during decoding (Wu, 2016). 3

  4. Main Contributions J'ai mangé le sandwich 1. Fertility-based Neural Machine Translation Model 
 (Bounds on source attention weights) 2. Novel attention transform function: Constrained Sparsemax (Enforces these bounds) 3. Evaluation Metrics: REP-Score and DROP-Score 4

  5. NMT + Attention Architecture 5

  6. I ate the sandwich e1 e2 e3 e4 c 1 c 2 c 3 c 4 g 1 g 2 g 3 g 4 attn_transform attn_score h 2 h 3 h 4 h 1 attn_score: 
 • dot-product (Luong, 2015) • bilinear function • MLP (Bahdanau, 2014) attn_transform: 
 • traditional softmax f1 f2 f3 f4 • constrained softmax (Martins & Kreutzer, 2017) • sparsemax (Martins & Astudillo, 2016) J'ai mangé le sandwich • constrained sparsemax ( this work ) 6

  7. Attention Transform Functions • Sparsemax: Euclidean projection of z provides sparse probability distributions. • Constrained Softmax: Returns the distribution closest 
 to softmax whose attention probabilities are bounded by upper bounds u. 7

  8. Attention Transform Functions Sparse and Constrained? 8

  9. Constrained Sparsemax • Provides sparse and bounded probability distributions. • This transformation has two levels of sparsity: 
 over time steps & over attended words at each step. • Efficient linear and sublinear time algorithms for 
 forward and backward propagation. 9

  10. Visualization: Attention transform functions t=0 t=1 t=2 • csparsemax provides sparse and constrained 
 probabilities. 10

  11. Fertility-based NMT Model 11

  12. Fertility-based NMT • Allocate fertilities for each source word as attention 
 budgets that exhaust over decoding. • Fertility Predictor : Train biLSTM model supervised by fertilities from fast_align (IBM Model 2). 12

  13. Fertility-based NMT • Fertilities incorporated as: • Exhaustion strategy to encourage more attention for 
 words with larger credit remaining: 13

  14. Experiments 14

  15. Experiments • Experiments performed on 3 language pairs: 
 De-En (IWSLT 2014), Ro-En (Europarl), Ja-En (KFTT). • Joint BPE with 32K merge operations. • Default hyperparameter settings in OpenNMT-Py. • Baselines: Softmax , + CovPenalty (Wu, 2016) and 
 + CovVector (Tu, 2016) 15

  16. Evaluation Metrics: REP-Score & DROP-Score REP Score: • Penalizes n-gram repetitions in predicted translations. • Normalize by number of words in reference corpus. DROP Score: • Find word alignments from source to reference & source to predicted. • % of source words aligned with some word in reference, 
 but not with any word in predicted translation. 16

  17. Results BLEU Scores 31 30.08 29.85 29.81 29.77 29.69 29.67 29.63 29.51 27.8 24.6 21.4 21.53 21.31 20.7 20.36 18.2 15 De-En Ja-En Ro-En softmax softmax+CovPenalty softmax+CovVector csparsemax 17

  18. Lower is better! REP Scores 16.0 12.8 14.12 13.48 11.4 11.07 9.6 6.4 3.2 3.47 3.37 2.93 2.67 2.45 2.48 2.42 1.98 0.0 De-En Ja-En Ro-En softmax softmax+CovPenalty softmax+CovVector csparsemax DROP Scores 24.0 23.3 22.79 22.18 21.59 19.2 14.4 9.6 4.8 5.89 5.74 5.65 5.59 5.49 5.47 5.44 5.23 0.0 De-En Ja-En Ro-En 18

  19. • csparsemax 
 yields sparse set of 
 alignments and 
 avoids repetitions. softmax csparsemax 19

  20. Examples of Translations 20

  21. More in the paper… 21

  22. Thank You! Code : www.github.com/Unbabel/ sparse_constrained_attention Questions? 22

Recommend


More recommend