eric mintun
play

Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating - PowerPoint PPT Presentation

Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International


  1. Eric Mintun HEP-AI Journal Club May 15th, 2018

  2. Outline • Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning � Representations, 2015. arXiv:1409.0473 [cs.CL] • Generalizations and a little theory Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: � 1702.00887 [cs.CL] • Why attention might be better than RNNs and CNNs Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł ukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference on Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]

  3. Translation French L’ accord sur la zone économique européenne a été signé en août 1992 . <end> R R R R R R R R R R R R R R R N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N c Context Vector R R R R R R R R R R R R R N N N N N N N N N N N N N N N N N N N N N N N N N N The agreement on the European Economic Area was signed in 1992 . <end> English

  4. Translation • Fixed-size context vector struggles with long sentences, fails later in sentence. � � � • Underlined portion becomes ‘based on his state of health’.

  5. Translation w/ Attention L’ accord sur la zone économique européenne a été signé en août 1992 . <end> R R R R R R R R R R R R R R R N N N N N N N N N N N N N N N s i − 1 N N N N N N N N N N N N N N N X α ji = 1 0 ≤ α ji ≤ 1 c i j α 1 i ( h 1 , s i − 1 ) ⊕ h 8 h 9 h 11 h 12 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 10 h 13 R R R R R R R R R R R R R N N N N N N N N N N N N N N N N N N N N N N N N N N The agreement on the European Economic Area was signed in 1992 . <end>

  6. Translation w/ Attention

  7. Translation w/ Attention

  8. Attention • Attention consists of learned key-value pairs. • Input query is compared against the key. A better match lets more of the value through: V i  � Q � X X Compare /w w i = 1 × w i � i K i i out i • Additive compare: Q and K fed into neural net • Multiplicative compare: Dot-product Q and K

  9. Keys/Values for Example L’ accord sur la zone économique R R R R R R Query: s i − 1 N N N N N N · · · s i − 1 N N N N N N Keys: h j c i ⊕ α 1 i ( h 1 , s i − 1 ) Values: h j h 1 h 2 h 3 h 4 h 5 R R R R R N N N N N · · · Compare: α ji ( h j , s i − 1 ) N N N N N Additive Attention The agreement on the European

  10. Outline • Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning � Representations, 2015. arXiv:1409.0473 [cs.CL] • Generalizations and a little theory Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: � 1702.00887 [cs.CL] • Why attention might be better than RNNs and CNNs Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł ukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference on Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]

  11. Structured Attention • What if we know trained attention should have a known structure? E.g.: • Each output by decoder should attend to a connected subsequence in encoder (character to word conversion). • Output sequence is organized as a tree (sentence parsing, equation input and output).

  12. Structured Attention • Attention weights define a probability α i distribution. Write context vector as: n f ( x , z ) = x z X c = E z ∼ p ( z | x,q ) [ f ( x, z )] = p ( z = i | x, q ) x i � z ∈ 1 , . . . , n i =1 α i ( k, q ) • Generalize this by adding more latent variables, changing annotation function. Add structure by dividing into cliques: X c = E z ∼ p ( z | x,q ) [ f ( x, z )] = E z ∼ p ( z C | x,q ) [ f C ( x, z C )] C X ! p ( z | x, q ; θ ) = softmax θ C ( z C ) C

  13. Subsequence Attention • (a) original unstructured attention network • (b) 1 independent binary latent variable per input: n n X X c = E z 1 ,...,z n [ f ( x, z )] = p ( z i = 1 | x, q ) x i f ( x , z ) = { z i = 1 } x i � i =1 i =1 � p ( z i = 1 | x, q ) = sigmoid( θ i ) z i ∈ 0 , 1 • (c) probability of each z depends on neighbors. n − 1 ! X p ( z 1 , . . . , z n ) = softmax θ i,i +1 ( z i , z i +1 ) i =1

  14. Subsequence Attention (a) (b) (c) Truth

  15. Tree Attention • Task: � • Latent variables if symbol has parent : z ij = 1 i j 0 1 X @ { z is valid } p ( z | x, q ) = softmax { z ij = 1 } θ ij � A i 6 = j • Context vector per symbol that attends to its parent in the tree: n X c j = p ( z ij = 1 | x, q ) x i � i =1 • No input query in this case, since a symbol’s parent doesn’t depend on decoder’s location.

  16. Tree Attention Simple Structured

  17. Outline • Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning � Representations, 2015. arXiv:1409.0473 [cs.CL] • Generalizations and a little theory Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: � 1702.00887 [cs.CL] • Why attention might be better than RNNs and CNNs Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł ukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference on Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]

  18. Attention Is All You Need • Can we replace CNNs and RNNs with attention for sequential tasks? • Self attention: the sequence is the query, key, and value. • Stack attention layers: output of attention layer is a sequence which is fed into the next layer. • Attention loses positional information; must insert as additional input.

  19. Attention Is All You Need Outputs probabilities Regular attention: keys for just the next word. and values from encoder, query from decoder. All linear layers applied per position with weight sharing. Stacked a fixed Masked to prevent N number of times attending to words that were written later. Self attention: keys, Pointwise add sinusoids values, queries all of different frequencies from previous layer to the input features. Input entire sequence, size is n x d model Input sequence generated so far

  20. Multi-Head Attention After concat, dimension is d model again. Learn linear projections into h separate d model /h Run h separate multiplicative size vectors attention steps. Scale the dot-product by (d model /h) 1/2

  21. Self Attention • Why? Self-attention improves long-range correlations and parallelization, and sometimes complexity. n: sequence length RNNs and CNNs need d: representation length a d x d matrix of weights, k: kernel size attention uses length d r: restriction size dot product. Using dilated convolutions, Whole sequence attends otherwise O(n/k) to every position

  22. Attention is All You Need

  23. Other Cool Things • Image captioning: like translation but replace encoder with CNN. Can see where network is ‘looking’. � � Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. “Show, attend, and tell: neural image caption generation with visual attention.” In International Conference on Machine Learning, 2015. arXiv:1502.03044 [cs.LG] • Hard attention: sample from probability distribution instead of taking expectation value. No longer differentiable, so train as RL algorithm where choosing attention target is an action.

  24. Summary • Attention is an architecture-level construct for sequence analysis. • It is essentially learned, differentiable dictionary look-up. • More generally, it is an input-dependent, learned probability distribution for latent variables that annotate output values. • Better long range correlation and parallelization than RNNs, often less complex. • Produces human interpretable intermediate data.

Recommend


More recommend