self attention for generative models
play

Self-Attention For Generative Models Ashish Vaswani and Anna Huang - PowerPoint PPT Presentation

Self-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin, Llion Jones, Justin Gilmer, David Bieber, Jonathan Frankle, Jakob Uszkoreit, and others. Learning


  1. Continuations to given initial motif Given motif RNN-LSTM

  2. Continuations to given initial motif Given motif RNN-LSTM

  3. Continuations to given initial motif Given motif RNN-LSTM Transformer

  4. Continuations to given initial motif Given motif RNN-LSTM Transformer

  5. Continuations to given initial motif Given motif RNN-LSTM Transformer Music Transformer

  6. Continuations to given initial motif Given motif RNN-LSTM Transformer Music Transformer

  7. Self-Similarity in Music

  8. Sample from Music Transformer

  9. Attention: a weighted average TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

  10. Attention: a weighted average TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

  11. Convolution: Different linear transformations by relative position. TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

  12. Relative attention (Shaw et al, 2018) Multihead attention + convolution? TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

  13. Closer look at attention QE rT

  14. Closer look at relative attention Modulated by relative positions 0,0 0 0,1 1 0,2 2 1,0 -1 1,1 0 1,2 1 2,0 -2 2,1 -1 2,2 0 QE rT

  15. Machine Translation (Shaw et al, 2018) Model Position BLEU BLEU Representati En-De En-Fr on Transformer Big Absolute 27.9 41.3 Transformer Big Relative 29.2 41.5

  16. Previous work O(L 2 D): 8.5 GB per layer (Shaw et al, 2018) Per layer, L=2048, D=512 Relative embeddings Multiply by Q Relative distances 0 -1 0 -2 -1 0

  17. Our formulation O(LD): 4.2 MB per layer Per layer, L=2048, D=512 Absolute by absolute Absolute by relative Skew Pad Reshape Slice i q

  18. Goal of skewing procedure Indexed by absolute by absolute absolute by relative

  19. Skewing to reduce relative memory from O(L 2 D) to O(LD) Per layer, L=2048, D=512 0 -1 0 Previous work -2 -1 0 O(L 2 D): 8.5 GB Relative Multiply by Q i q embeddings E r Skew Our work O(LD): 4.2 MB QE rT S rel Reshape Slice Pad i q Directly multiply by Q Per layer, L=2048, D=512 O(L 2 D): 8.5 GB Q skew(QE T ) O(LD): 4.2 MB (ours) E T

  20. A Jazz sample from Music Transformer

  21. A Jazz sample from Music Transformer

  22. Convolutions and Translational Equivariance 0 32 0 32 0.5 0.5 32 32

  23. Relative positions Translational Equivariance 0 32 0 32 0.5 0.5 32 32

  24. Relative Attention And Graphs

Recommend


More recommend