Eric Mintun HEP-AI Journal Club May 15th, 2018
Outline • Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning � Representations, 2015. arXiv:1409.0473 [cs.CL] • Generalizations and a little theory Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: � 1702.00887 [cs.CL] • Why attention might be better than RNNs and CNNs Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł ukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference on Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]
Translation French L’ accord sur la zone économique européenne a été signé en août 1992 . <end> R R R R R R R R R R R R R R R N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N c Context Vector R R R R R R R R R R R R R N N N N N N N N N N N N N N N N N N N N N N N N N N The agreement on the European Economic Area was signed in 1992 . <end> English
Translation • Fixed-size context vector struggles with long sentences, fails later in sentence. � � � • Underlined portion becomes ‘based on his state of health’.
Translation w/ Attention L’ accord sur la zone économique européenne a été signé en août 1992 . <end> R R R R R R R R R R R R R R R N N N N N N N N N N N N N N N s i − 1 N N N N N N N N N N N N N N N X α ji = 1 0 ≤ α ji ≤ 1 c i j α 1 i ( h 1 , s i − 1 ) ⊕ h 8 h 9 h 11 h 12 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 10 h 13 R R R R R R R R R R R R R N N N N N N N N N N N N N N N N N N N N N N N N N N The agreement on the European Economic Area was signed in 1992 . <end>
Translation w/ Attention
Translation w/ Attention
Attention • Attention consists of learned key-value pairs. • Input query is compared against the key. A better match lets more of the value through: V i � Q � X X Compare /w w i = 1 × w i � i K i i out i • Additive compare: Q and K fed into neural net • Multiplicative compare: Dot-product Q and K
Keys/Values for Example L’ accord sur la zone économique R R R R R R Query: s i − 1 N N N N N N · · · s i − 1 N N N N N N Keys: h j c i ⊕ α 1 i ( h 1 , s i − 1 ) Values: h j h 1 h 2 h 3 h 4 h 5 R R R R R N N N N N · · · Compare: α ji ( h j , s i − 1 ) N N N N N Additive Attention The agreement on the European
Outline • Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning � Representations, 2015. arXiv:1409.0473 [cs.CL] • Generalizations and a little theory Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: � 1702.00887 [cs.CL] • Why attention might be better than RNNs and CNNs Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł ukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference on Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]
Structured Attention • What if we know trained attention should have a known structure? E.g.: • Each output by decoder should attend to a connected subsequence in encoder (character to word conversion). • Output sequence is organized as a tree (sentence parsing, equation input and output).
Structured Attention • Attention weights define a probability α i distribution. Write context vector as: n f ( x , z ) = x z X c = E z ∼ p ( z | x,q ) [ f ( x, z )] = p ( z = i | x, q ) x i � z ∈ 1 , . . . , n i =1 α i ( k, q ) • Generalize this by adding more latent variables, changing annotation function. Add structure by dividing into cliques: X c = E z ∼ p ( z | x,q ) [ f ( x, z )] = E z ∼ p ( z C | x,q ) [ f C ( x, z C )] C X ! p ( z | x, q ; θ ) = softmax θ C ( z C ) C
Subsequence Attention • (a) original unstructured attention network • (b) 1 independent binary latent variable per input: n n X X c = E z 1 ,...,z n [ f ( x, z )] = p ( z i = 1 | x, q ) x i f ( x , z ) = { z i = 1 } x i � i =1 i =1 � p ( z i = 1 | x, q ) = sigmoid( θ i ) z i ∈ 0 , 1 • (c) probability of each z depends on neighbors. n − 1 ! X p ( z 1 , . . . , z n ) = softmax θ i,i +1 ( z i , z i +1 ) i =1
Subsequence Attention (a) (b) (c) Truth
Tree Attention • Task: � • Latent variables if symbol has parent : z ij = 1 i j 0 1 X @ { z is valid } p ( z | x, q ) = softmax { z ij = 1 } θ ij � A i 6 = j • Context vector per symbol that attends to its parent in the tree: n X c j = p ( z ij = 1 | x, q ) x i � i =1 • No input query in this case, since a symbol’s parent doesn’t depend on decoder’s location.
Tree Attention Simple Structured
Outline • Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning � Representations, 2015. arXiv:1409.0473 [cs.CL] • Generalizations and a little theory Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: � 1702.00887 [cs.CL] • Why attention might be better than RNNs and CNNs Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł ukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference on Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]
Attention Is All You Need • Can we replace CNNs and RNNs with attention for sequential tasks? • Self attention: the sequence is the query, key, and value. • Stack attention layers: output of attention layer is a sequence which is fed into the next layer. • Attention loses positional information; must insert as additional input.
Attention Is All You Need Outputs probabilities Regular attention: keys for just the next word. and values from encoder, query from decoder. All linear layers applied per position with weight sharing. Stacked a fixed Masked to prevent N number of times attending to words that were written later. Self attention: keys, Pointwise add sinusoids values, queries all of different frequencies from previous layer to the input features. Input entire sequence, size is n x d model Input sequence generated so far
Multi-Head Attention After concat, dimension is d model again. Learn linear projections into h separate d model /h Run h separate multiplicative size vectors attention steps. Scale the dot-product by (d model /h) 1/2
Self Attention • Why? Self-attention improves long-range correlations and parallelization, and sometimes complexity. n: sequence length RNNs and CNNs need d: representation length a d x d matrix of weights, k: kernel size attention uses length d r: restriction size dot product. Using dilated convolutions, Whole sequence attends otherwise O(n/k) to every position
Attention is All You Need
Other Cool Things • Image captioning: like translation but replace encoder with CNN. Can see where network is ‘looking’. � � Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. “Show, attend, and tell: neural image caption generation with visual attention.” In International Conference on Machine Learning, 2015. arXiv:1502.03044 [cs.LG] • Hard attention: sample from probability distribution instead of taking expectation value. No longer differentiable, so train as RL algorithm where choosing attention target is an action.
Summary • Attention is an architecture-level construct for sequence analysis. • It is essentially learned, differentiable dictionary look-up. • More generally, it is an input-dependent, learned probability distribution for latent variables that annotate output values. • Better long range correlation and parallelization than RNNs, often less complex. • Produces human interpretable intermediate data.
Recommend
More recommend