Continuations to given initial motif Given motif RNN-LSTM
Continuations to given initial motif Given motif RNN-LSTM
Continuations to given initial motif Given motif RNN-LSTM Transformer
Continuations to given initial motif Given motif RNN-LSTM Transformer
Continuations to given initial motif Given motif RNN-LSTM Transformer Music Transformer
Continuations to given initial motif Given motif RNN-LSTM Transformer Music Transformer
Self-Similarity in Music
Sample from Music Transformer
Attention: a weighted average TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
Attention: a weighted average TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
Convolution: Different linear transformations by relative position. TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
Relative attention (Shaw et al, 2018) Multihead attention + convolution? TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
Closer look at attention QE rT
Closer look at relative attention Modulated by relative positions 0,0 0 0,1 1 0,2 2 1,0 -1 1,1 0 1,2 1 2,0 -2 2,1 -1 2,2 0 QE rT
Machine Translation (Shaw et al, 2018) Model Position BLEU BLEU Representati En-De En-Fr on Transformer Big Absolute 27.9 41.3 Transformer Big Relative 29.2 41.5
Previous work O(L 2 D): 8.5 GB per layer (Shaw et al, 2018) Per layer, L=2048, D=512 Relative embeddings Multiply by Q Relative distances 0 -1 0 -2 -1 0
Our formulation O(LD): 4.2 MB per layer Per layer, L=2048, D=512 Absolute by absolute Absolute by relative Skew Pad Reshape Slice i q
Goal of skewing procedure Indexed by absolute by absolute absolute by relative
Skewing to reduce relative memory from O(L 2 D) to O(LD) Per layer, L=2048, D=512 0 -1 0 Previous work -2 -1 0 O(L 2 D): 8.5 GB Relative Multiply by Q i q embeddings E r Skew Our work O(LD): 4.2 MB QE rT S rel Reshape Slice Pad i q Directly multiply by Q Per layer, L=2048, D=512 O(L 2 D): 8.5 GB Q skew(QE T ) O(LD): 4.2 MB (ours) E T
A Jazz sample from Music Transformer
A Jazz sample from Music Transformer
Convolutions and Translational Equivariance 0 32 0 32 0.5 0.5 32 32
Relative positions Translational Equivariance 0 32 0 32 0.5 0.5 32 32
Relative Attention And Graphs
Recommend
More recommend