alternative architectures
play

Alternative Architectures Philipp Koehn 15 October 2020 Philipp - PowerPoint PPT Presentation

Alternative Architectures Philipp Koehn 15 October 2020 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 Alternative Architectures 1 We introduced one translation model attentional seq2seq model core


  1. Alternative Architectures Philipp Koehn 15 October 2020 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  2. Alternative Architectures 1 • We introduced one translation model – attentional seq2seq model – core organizing feature: recurrent neural networks • Other core neural architectures – convolutional neural networks – attention • But first: look at various components of neural architectures Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  3. 2 components Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  4. Components of Neural Networks 3 • Neural networks originally inspired by the brain – a neuron receives signals from other neurons – if sufficiently activated, it sends signals – feed-forward layers are roughly based on this • Computation graph – any function possible, as long as it is partially differentiable – not limited by appeals to biological validity • Deep learning maybe a better name Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  5. Feed-Forward Layer 4 • Classic neural network component • Given an input vector x , matrix multiplication M with adding a bias vector b Mx + b • Adding a non-linear activation function y = activation ( Mx + b ) • Notation y = FF activation ( x ) = a ( Mx + b ) Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  6. Feed-Forward Layer 5 • Historic neural network designs: several feed-forward layers – input layer – hidden layers – output layer • Powerful tools for a wide range of machine learning problems • Matrix multiplication also called affine transforms – appeals to its geometrical properties – straight lines in input still straight lines in output Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  7. Factored Decomposition 6 • One challenge: very large input and output vectors • Number of parameters in matrix M = | x | × | y | ⇒ Need to reduce size of matrix • Solution: first reduce to smaller representation M y A y v B x x Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  8. Factored Decomposition: Math 7 M y A y v B x x • Intuition – given highly dimension vector x – first map to into lower dimensional vector v (matrix A ) – then map to output vector y (matrix B ) v = Ax y = Bv = BAx • Example – | x | = 20,000, | y | = 50,000 → M = 1,000,000,000 – | v | = 100 → A = 20,000 × 100 = 2,000,000, B = 100 × 50,000 = 5,000,000 – reduction from 1,000,000,000 to 7,000,000 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  9. Factored Decomposition: Interpretation 8 • Vector v is a bottleneck feature • Forced to captures salient features • One example: word embeddings Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  10. 9 basic mathematical operations Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  11. Concatenation 10 • Often multiple input vectors to processing step • For instance recurrent neural network – input word – previous state • Combined in feed-forward layer y = activation ( M 1 x 1 + M 2 x 2 + b ) • Another view x = concat ( x 1 , x 2 ) y = activation ( Mx + b ) • Splitting hairs here, but concatenation useful generally Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  12. Addition 11 • Adding vectors: very simplistic, but often done • Example: compute sentence embeddings s from word embeddings w 1 , ..., w n n � s = w i i • Reduces varying length sentence representation into fixed sized vector • Maybe weight the words, e.g., by attention Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  13. Multiplication 12 • Another elementary mathematical operation • Three ways to multiply vectors – element-wise multiplication � � � � � � v 1 × u 1 v 1 u 1 v ⊙ u = ⊙ = v 2 × u 2 v 2 u 2 – dot product � T � � � v 1 u 1 v · u = v T u = = v 1 × u 1 + v 2 × u 2 v 2 u 2 used for simple version of attention mechanism – third possibility: vu T , not commonly done Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  14. Maximum 13 • Goal: reduce the dimensionality of representation • Example: detect if a face is in image – any region of image may have positive match – represent different regions with element in a vector – maximum value: any region has a face • Max pooling – given: n dimensional vector – goal: reduce to n k dimensional vector – method: break up vector into blocks of k elements, map each into single value Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  15. Max Out 14 • Max out – first branch out into multiple feed-forward layers W 1 x + b 1 W 2 x + b 2 – element-wise maximum maxout ( x ) = max ( W 1 x + b 1 , W 2 x + b 2 ) • ReLu activation is a maxout layer: maximum of feed-forward layer and 0 ReLu ( x ) = max ( Wx + b, 0) Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  16. 15 processing sequences Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  17. Recurrent Neural Networks 16 • Already described recurrent neural networks at length – propagate state s – over time steps t – receiving an input x t at each turn s t = f ( s t − 1 , x t ) (state may computed may as a feed-forward layer) • More successful – gated recurrent units (GRU) – long short-term memory cells (LSTM) • Good fit for sequences, like words in a sentence – humans also receive word by word – most recent words most relevant → closer to current state • But computational problematic: very long computation chains Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  18. Alternative Sequence Processing 17 • Convolutional neural networks • Attention Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  19. 18 convolutional neural networks Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  20. Convolutional Neural Networks (CNN) 19 • Popular in image processing • Regions of an image are reduced into increasingly smaller representation – matrix spanning part of image reduced to single value – overlapping regions Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  21. CNNs for Language 20 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed • Map words into fixed-sized sentence representation Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  22. Hierarchical Structure and Language 21 • Syntactic and semantic theories of language – language is recursive – central: verb – dependents: subject, objects, adjuncts – their dependents: adjectives, determiners – also nested: relative clauses • How to compute sentence embeddings active research topic Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  23. Convolutional Neural Networks 22 • Key step – take a high dimensional input representation – map to lower dimensional representation • Several repetitions of this step • Examples – map 50 × 50 pixel area into scalar value – combine 3 or more neighboring words into a single vector • Machine translation – encode input sentence into single vector – decode this vector into a sentence in the output language Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  24. 23 attention Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  25. Attention 24 • Machine translation is a structured prediction task – output is not a single label – output structure needs to be built, word by word • Relevant information for each word prediction varies • Human translators pay attention to different parts of the input sentence when translating ⇒ Attention mechanism Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  26. Computing Attention 25 • Attention mechanism in neural translation model (Bahdanau et al., 2015) – previous hidden state s i − 1 – input word embedding h j – trainable parameters b , W a , U a , v a a ( s i − 1 , h j ) = v T a tanh ( W a s i − 1 + U a h j + b ) • Other ways to compute attention – Dot product: a ( s i − 1 , h j ) = s T i − 1 h j √ | h j | s T 1 – Scaled dot product: a ( s i − 1 , h j ) = i − 1 h j – General: a ( s i − 1 , h j ) = s T i − 1 W a h j – Local: a ( s i − 1 ) = W a s i − 1 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

  27. Attention of Luong et al. (2015) 26 • Luong et al. (2015) demonstrate good results with the dot product a ( s i − 1 , h j ) = s T i − 1 h j • No trainable parameters • Additional changes • Currently more popular Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Recommend


More recommend