rnn and musical applications
play

RNN and Musical Applications Juhan Nam Motivation When the output - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) RNN and Musical Applications Juhan Nam Motivation When the output is sequential, e.g., pitch estimation or note transcription, CNN predicts each of the successive outputs


  1. GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) RNN and Musical Applications Juhan Nam

  2. Motivation ● When the output is sequential, e.g., pitch estimation or note transcription, CNN predicts each of the successive outputs independently ● Can we predict the output considering the surrounding context (features in previous or next predictions) for better performance ? Each prediction is indepedent

  3. Motivation ● Method #1: Increasing the input context size May improve the performance but need a relatively deeper network ○ Limited to capturing successive features only in close neighbors ○

  4. Motivation ● Method #2: connect the feature maps temporally We use information from not only the input but also the hidden layer states ○ in the previous or next time steps We regard the hidden units activations as dynamic states which are ○ successively updated By doing so, the model is expected to capture wider input context ○ The update can be forward or backward in time

  5. Recurrent Neural Networks (RNN) ● A family of neural networks that have connections between previous states and current states of hidden layers The hidden units are “state vectors” with regard to the input index (i.e. time) ○ $ 𝑋 𝑋 $ % # 𝑋 # 𝑋 % " 𝑋 " 𝑋 % 𝑋 ! ! 𝑋 %

  6. Recurrent Neural Networks (RNN) ● This simple structure is often called “Vanilla RNN” tanh(𝑦) is a common choice of the activation function 𝑕(𝑦) ○ & ℎ % (𝑢) + 𝑐 & ) 𝑧(𝑢) = 𝑔(𝑋 - # $ 𝑋 % % ℎ $ (𝑢) + 𝑐 % ) % ℎ % 𝑢 − 1 + 𝑋 ℎ % (𝑢) = 𝑕(𝑋 " # # 𝑋 $ ℎ $ (𝑢 − 1) + 𝑋 $ ℎ ! (𝑢) + 𝑐 $ ) ℎ $ (𝑢) = 𝑕(𝑋 % " # ! ℎ ! (𝑢 − 1) +𝑋 ! 𝑦(𝑢) + 𝑐 ! ) ℎ ! (𝑢) = 𝑕(𝑋 " 𝑋 " # % 𝑢 = 0, 1, 2, … recurrent connections: how much information from ! 𝑋 % the previous state is used to update the current state

  7. Training RNN: Forward Pass ● The hidden layers keep updating the states over the time steps Regard the progressively extended neural network over time as a single large ○ " , 𝑋 " ) are shared at each time step neural network where the weights ( 𝑋 ! # 𝑧(1) ' 𝑧(1) ' 𝑧(𝑈) ' $ $ $ 𝑋 𝑋 𝑋 % # % # % 𝑋 𝑋 # 𝑋 & & . . . & # # # 𝑋 𝑋 𝑋 " " % % % 𝑋 𝑋 " 𝑋 & & & . . . " " " 𝑋 𝑋 𝑋 % % % ! ! 𝑋 𝑋 ! 𝑋 & & & . . . ! ! ! 𝑋 𝑋 𝑋 % % % 𝑦(1) 𝑦(𝑈) 𝑦(2) Unrolled RNN

  8. Training RNN: Backward Pass ● Backpropagation through time (BPTT) Gradients flow in both the top-down pass and the time ○ 𝑀(𝑈) 𝑀(1) 𝑀(2) 𝑧(1) ' 𝑧(2) ' 𝑧(𝑈) ' $ $ $ 𝑋 𝑋 𝑋 % # % # % 𝑋 𝑋 # 𝑋 & & . . . & # # # 𝑋 𝑋 𝑋 " " % % % 𝑋 𝑋 " 𝑋 & & & . . . " " " 𝑋 𝑋 𝑋 % % % ! ! 𝑋 𝑋 ! 𝑋 & & & . . . ! ! ! 𝑋 𝑋 𝑋 % % % 𝑦(1) 𝑦(𝑈) 𝑦(2) Unrolled RNN

  9. The Problem of Vanilla RNN ● As the time steps increase in the training, the gradients during BPTT can become unstable Exploding or vanishing gradients ○ ● Exploding gradients can be controlled by gradient clipping but vanishing gradients require a different architecture ● In practice, the vanilla RNNs are used only when the input is a short sequence.

  10. Vanilla RNN ● Another view Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  11. Long Short-Term Memory (LSTM) ● Four neural network layers in one module Two recurrent flows ○ Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  12. Long Short-Term Memory (LSTM) ● Cell state (“the key to LSTM”) Information can flow through without a change: similar to the skip connection! ○ The sigmoid gates are a relaxation of the binary gate (0 or 1) ○ Forget gate Input gate Input gate Forget gate New information Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  13. Long Short-Term Memory (LSTM) ● Generate the next state from the cell Output gate Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  14. Long Short-Term Memory (LSTM) ● Long-term dependency can be learned ● Much more powerful than the vanilla RNNs 34-layer residual ○ ○ ○ image We can use long sequence data as input The structure with two current flows is similar to ResNet Uninterrupted gradient flow is possible through the cell over time steps 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000

  15. Sequence Setups using RNN ● There are several different input and output setups in RNN Many-to-One/One-to-Many Many-to-Many Many-to-One One-to-Many (Seq2Seq)

  16. Many-to-Many RNN ● Both input and output are sequence Assume that the input and output data are strongly aligned in training data ○ When the alignment is weak, the attention layer is added or Seq2Seq is used ■ Bi-directional RNN is more commonly used unless it is for a real-time system ○ Use cases ○ Video classification: image frames to label frames ■ Part-of-speech tagging: sentence to tags ■ Automatic music transcription: audio to note/pitch/beat/chord ■ Sound event detection: audio to events ■ Bi-directional RNN Uni-directional RNN (use both past and (use past information future information) only)

  17. Convolutional Recurrent Neural Network (CRNN) ● When input is high-dimensional (image or audio), CNN and RNN are combined CNN provides the embedding vector which is used as input of RNN ○ Labels on frame-level Note/Pitch/Beat/Chord/Event Image/Audio Embedding CNN Video Audio

  18. Language Model (LM) ● Predict the next word given a sequence of words Compute the probability distribution of the next word 𝑦 (#) given a sequence ○ of words 𝑦 (#&') , . . . , 𝑦 (') à 𝑄(𝑦 # |𝑦 #&' , . . . , 𝑦 ' ) 𝑦 (-) can be any word from the vocabulary 𝑊 = {𝑥 . , … , 𝑥 |0| } ■ The likelihood can be computed for a sentence ○ “am” “so” “full” 𝑄 𝑦 . , 𝑦 1 , . . . , 𝑦 2 𝑄 𝑦 - | 𝑦 -4. , . . . , 𝑦 . 2 = ∏ -3. ■ Trained in the many-to-many RNN setting ○ Word Embedding Transformer is more dominantly used these days ■ LM has many applications ○ “I” “am” “so” Text generation: predict the most likely words ■ Language Model using RNN Speech recognition (acoustic model + LM) ■ By replacing words with musical notes or MIDI events, we can build a “Musical Language Model” which can be used for music generation and automatic music transcription

  19. Many-to-One RNN ● The input is a sequence and the output is a categorical label The input sequence can have a variable length! ○ Use cases ○ Text classification: sentence to labels (pos/neg) ■ Music genre/mood/audio scene classification and tagging: audio to labels ■ Video scene classification: image frames to labels ■ “park” “positive” Audio/image Embedding Word Embedding “You” “are” “awesome” Text Classification Audio/Video Scene Classification

  20. One-to-Many RNN ● The input is a single shot of data and the output is sequence data This is regarded as a conditional generation model ○ Use cases ○ Image captioning: generate the text description of an image ■ Music playlist generation: playlist title (text) to a sequence of song embedding ■ vectors “The” “trees” “are” “yellow” track1 track2 track3 image Text Embedding Embedding <start> <start> “Bossa Nova Jazz Best” Music Playlist Generation Image Captioning

  21. Many-to-One/One-to-Many (Seq2Seq) ● Both input and output are sequence Assume that the input and output data are not aligned ○ Regarded as an encoder-decoder RNN framework ○ Use cases ○ Machine translation: sentence to sentence (neural machine translation) ■ Speech recognition/note-level singing vice transcription: audio to text/note ■ “ 난 ” “ 너를 ” “ 사랑해 ” “ 난 ” “ 너를 ” “ 사랑해 ” The compressed latent vector of the input sequence. Decoder Encoder Decoder Encoder RNN RNN RNN RNN The decoder becomes a conditional text “ 난 ” “ 너를 ” <EOS> “ 난 ” “ 너를 ” “I” “love” “you” generation model <EOS> Machine Translation Speech Recognition Singing Voice Transcription Seq2Seq is a conditional language model!

  22. MIR Tasks Using RNN ● Many-to-many RNN Vocal melody extraction ○ Polyphonic piano transcription ○ Beat tracking ○ Chord recognition ○ ● Many-to-one RNN Music auto-tagging ○

Recommend


More recommend