structured attention networks
play

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang - PowerPoint PPT Presentation

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush HarvardNLP 1 Deep Neural Networks for Text Processing and Generation 2 Attention Networks 3 Structured Attention Networks Computational Challenges


  1. Structured Attention Networks for Neural Machine Translation

  2. Structured Attention Networks for Neural Machine Translation

  3. Structured Attention Networks for Neural Machine Translation

  4. Structured Attention Networks for Neural Machine Translation

  5. Motivation: Structured Output Prediction Modeling the structured output (i.e. graphical model on top of a neural net) has improved performance (LeCun et al., 1998; Lafferty et al., 2001; Collobert et al., 2011) Given a sequence x = x 1 , . . . , x T Factored potentials θ i,i +1 ( z i , z i +1 ; x ) � T − 1 � � p ( z 1 . . . , z T | x ; θ ) = softmax θ i,i +1 ( z i , z i +1 ; x ) i =1 � T − 1 = 1 � � Z exp θ i,i +1 ( z i , z i +1 ; x ) i =1 � T − 1 � � � θ i,i +1 ( z ′ i , z ′ Z = exp i +1 ; x ) z ′ ∈C i =1

  6. Example: Part-of-Speech Tagging

  7. Example: Part-of-Speech Tagging

  8. Example: Part-of-Speech Tagging

  9. Example: Part-of-Speech Tagging

  10. Example: Part-of-Speech Tagging

  11. Neural CRF for Sequence Tagging (Collobert et al., 2011)

  12. Neural CRF for Sequence Tagging (Collobert et al., 2011) Unary potentials θ i ( c ) = w ⊤ c x i come from neural network

  13. Inference in Linear-Chain CRF Pairwise potentials are simple parameters b , so altogether θ i,i +1 ( c, d ) = θ i ( c ) + θ i +1 ( d ) + b c,d

  14. 1 Deep Neural Networks for Text Processing and Generation 2 Attention Networks 3 Structured Attention Networks Computational Challenges Structured Attention In Practice 4 Conclusion and Future Work

  15. Structured Attention Networks: Notation x 1 , . . . , x T Memory bank q Query z = z 1 , . . . , z T Memory selection over structures p ( z | x, q ; θ ) Attention distribution over structures f ( x, z ) Annotation function (Neural representation) c = ❊ z ∼ p ( z | x,q ) [ f ( x, z )] Context vector Need to calculate T � c = p ( z i = 1 | x, q ) x i i =1

  16. Challenge: End-to-End Training Requirements: 1 Compute attention distribution (marginals) p ( z i | x, q ; θ ) = ⇒ Forward-backward algorithm 2 Gradients wrt attention distribution parameters θ = ⇒ Backpropagation through forward-backward algorithm

  17. Challenge: End-to-End Training Requirements: 1 Compute attention distribution (marginals) p ( z i | x, q ; θ ) = ⇒ Forward-backward algorithm 2 Gradients wrt attention distribution parameters θ = ⇒ Backpropagation through forward-backward algorithm

  18. Challenge: End-to-End Training Requirements: 1 Compute attention distribution (marginals) p ( z i | x, q ; θ ) = ⇒ Forward-backward algorithm 2 Gradients wrt attention distribution parameters θ = ⇒ Backpropagation through forward-backward algorithm

  19. Review: Forward-Backward Algorithm θ : input potentials (e.g. from NN) α, β : dynamic programming tables procedure ForwardBackward ( θ ) Forward for i = 1 , . . . , n ; z i do α [ i, z i ] ← � z i − 1 α [ i − 1 , z i − 1 ] × exp( θ i − 1 ,i ( z i − 1 , z i )) Backward for i = n, . . . , 1; z i do β [ i, z i ] ← � z i +1 ı β [ i + 1 , z i +1 ] × exp( θ i,i +1 ( z i , z i +1 )) Marginals for i = 1 , . . . , n ; c ∈ C do p ( z i = c | x ) ← α [ i, c ] × β [ i, c ] /Z

  20. Structured Attention Networks for Neural Machine Translation

  21. Forward-Backward Algorithm in Practice (Log-Space Semiring Trick) x ⊕ y = log(exp( x ) + exp( y )) x ⊗ y = x + y procedure ForwardBackward ( θ ) Forward for i = 1 , . . . , n ; z i do α [ i, z i ] ← � z i − 1 α [ i − 1 , y ] ⊗ θ i − 1 ,i ( z i − 1 , z i ) Backward for i = n, . . . , 1; z i do β [ i, z i ] ← � z i +1 β [ i + 1 , z i +1 ] ⊗ θ i,i +1 ( z i , z i +1 ) Marginals for i = 1 , . . . , n ; c ∈ C do p ( z i = c | x ) ← exp( α [ i, c ] ⊗ β [ i, c ] ⊗ − log Z )

  22. Backpropagating through Forward-Backward ∇ L p : Gradient of arbitrary loss L with respect to marginals p procedure BackpropForwardBackward ( θ, p, ∇ L p ) Backprop Backward for i = n, . . . 1; z i do ˆ z i +1 θ i,i +1 ( z i , z i +1 ) ⊗ ˆ β [ i, z i ] ← ∇ L α [ i, z i ] ⊕ � β [ i + 1 , z i +1 ] Backprop Forward for i = 1 , . . . , n ; z i do α [ i, z i ] ← ∇ L ˆ β [ i, z i ] ⊕ � z i − 1 θ i − 1 ,i ( z i − 1 , z i ) ⊗ ˆ α [ i − 1 , z i − 1 ] Potential Gradients for i = 1 , . . . , n ; z i , z i +1 do ∇ L θ i − 1 ,i ( z i ,z i +1 ) ← exp(ˆ α [ i, z i ] ⊗ β [ i + 1 , z i +1 ] ⊕ α [ i, z i ] ⊗ ˆ β [ i + 1 , z i +1 ] ⊕ α [ i, z i ] ⊗ β [ i + 1 , z i +1 ] ⊗ − log Z )

  23. Interesting Issue: Negative Gradients Through Attention ∇ L p : Gradient could be negative, but working in log-space! Signed Log-space semifield trick (Li and Eisner, 2009) Use tuples ( l a , s a ) where l a = log | a | and s a = sign( a ) ⊕ s a s b l a + b s a + b + + l a + log(1 + d ) + + − l a + log(1 − d ) + − + l a + log(1 − d ) − − − l a + log(1 + d ) − (Similar rules for ⊗ )

  24. Structured Attention Networks for Neural Machine Translation

  25. 1 Deep Neural Networks for Text Processing and Generation 2 Attention Networks 3 Structured Attention Networks Computational Challenges Structured Attention In Practice 4 Conclusion and Future Work

  26. Implementation http://github.com/harvardnlp/struct-attn General-purpose structured attention unit “Plug-and-play” neural network layers Dynamic programming is GPU-optimized for speed

  27. NLP Experiments Replace existing attention layers for Machine Translation Segmental Attention : 2 -state linear-chain CRF Question Answering Sequential Attention : N -state linear-chain CRF Natural Language Inference Syntactic Attention : graph-based dependency parser

  28. Segmental Attention for Neural Machine Translation Use segmentation CRF for attention, i.e. binary vectors of length n p ( z 1 , . . . , z T | x, q ) parameterized with a linear-chain CRF. Unary potentials (Encoder RNN):  x i Wq, k = 1  θ i ( k ) = 0 , k = 0  Pairwise potentials (Simple Parameters): 4 additional binary parameters (i.e., b 0 , 0 , b 0 , 1 , b 1 , 0 , b 1 , 1 )

  29. Segmental Attention for Neural Machine Translation Data: Japanese → English (from WAT 2015) Traditionally, word segmentation as a preprocessing step Use structured attention learn an implicit segmentation model Experiments: Japanese characters → English words Japanese words → English words

  30. Segmental Attention for Neural Machine Translation Simple Sigmoid Structured Char → Word 12 . 6 13 . 1 14 . 6 Word → Word 14 . 1 13 . 8 14 . 3 BLEU scores on test set (higher is better) Models: Simple softmax attention: softmax( θ i ) Sigmoid attention: sigmoid( θ i ) Structured attention: ForwardBackward( θ )

  31. Attention Visualization: Ground Truth

  32. Attention Visualization: Simple Attention

  33. Attention Visualization: Sigmoid Attention

  34. Attention Visualization: Structured Attention

  35. Sequential Attention over Facts for Question Answering Simple attention: Greedy soft-selection of K supporting facts

Recommend


More recommend