RNN and Musical Applications Juhan Nam Motivation When the output - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) RNN and Musical Applications Juhan Nam

Motivation ● When the output is sequential, e.g., pitch estimation or note transcription, CNN predicts each of the successive outputs independently ● Can we predict the output considering the surrounding context (features in previous or next predictions) for better performance ? Each prediction is indepedent

Motivation ● Method #1: Increasing the input context size May improve the performance but need a relatively deeper network ○ Limited to capturing successive features only in close neighbors ○

Motivation ● Method #2: connect the feature maps temporally We use information from not only the input but also the hidden layer states ○ in the previous or next time steps We regard the hidden units activations as dynamic states which are ○ successively updated By doing so, the model is expected to capture wider input context ○ The update can be forward or backward in time

Recurrent Neural Networks (RNN) ● A family of neural networks that have connections between previous states and current states of hidden layers The hidden units are “state vectors” with regard to the input index (i.e. time) ○ $ 𝑋 𝑋 $ % # 𝑋 # 𝑋 % " 𝑋 " 𝑋 % 𝑋 ! ! 𝑋 %

Recurrent Neural Networks (RNN) ● This simple structure is often called “Vanilla RNN” tanh(𝑦) is a common choice of the activation function 𝑕(𝑦) ○ & ℎ % (𝑢) + 𝑐 & ) 𝑧(𝑢) = 𝑔(𝑋 - # $ 𝑋 % % ℎ $ (𝑢) + 𝑐 % ) % ℎ % 𝑢 − 1 + 𝑋 ℎ % (𝑢) = 𝑕(𝑋 " # # 𝑋 $ ℎ $ (𝑢 − 1) + 𝑋 $ ℎ ! (𝑢) + 𝑐 $ ) ℎ $ (𝑢) = 𝑕(𝑋 % " # ! ℎ ! (𝑢 − 1) +𝑋 ! 𝑦(𝑢) + 𝑐 ! ) ℎ ! (𝑢) = 𝑕(𝑋 " 𝑋 " # % 𝑢 = 0, 1, 2, … recurrent connections: how much information from ! 𝑋 % the previous state is used to update the current state

Training RNN: Forward Pass ● The hidden layers keep updating the states over the time steps Regard the progressively extended neural network over time as a single large ○ " , 𝑋 " ) are shared at each time step neural network where the weights ( 𝑋 ! # 𝑧(1) ' 𝑧(1) ' 𝑧(𝑈) ' $ $ $ 𝑋 𝑋 𝑋 % # % # % 𝑋 𝑋 # 𝑋 & & . . . & # # # 𝑋 𝑋 𝑋 " " % % % 𝑋 𝑋 " 𝑋 & & & . . . " " " 𝑋 𝑋 𝑋 % % % ! ! 𝑋 𝑋 ! 𝑋 & & & . . . ! ! ! 𝑋 𝑋 𝑋 % % % 𝑦(1) 𝑦(𝑈) 𝑦(2) Unrolled RNN

Training RNN: Backward Pass ● Backpropagation through time (BPTT) Gradients flow in both the top-down pass and the time ○ 𝑀(𝑈) 𝑀(1) 𝑀(2) 𝑧(1) ' 𝑧(2) ' 𝑧(𝑈) ' $ $ $ 𝑋 𝑋 𝑋 % # % # % 𝑋 𝑋 # 𝑋 & & . . . & # # # 𝑋 𝑋 𝑋 " " % % % 𝑋 𝑋 " 𝑋 & & & . . . " " " 𝑋 𝑋 𝑋 % % % ! ! 𝑋 𝑋 ! 𝑋 & & & . . . ! ! ! 𝑋 𝑋 𝑋 % % % 𝑦(1) 𝑦(𝑈) 𝑦(2) Unrolled RNN

The Problem of Vanilla RNN ● As the time steps increase in the training, the gradients during BPTT can become unstable Exploding or vanishing gradients ○ ● Exploding gradients can be controlled by gradient clipping but vanishing gradients require a different architecture ● In practice, the vanilla RNNs are used only when the input is a short sequence.

Vanilla RNN ● Another view Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short-Term Memory (LSTM) ● Four neural network layers in one module Two recurrent flows ○ Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short-Term Memory (LSTM) ● Cell state (“the key to LSTM”) Information can flow through without a change: similar to the skip connection! ○ The sigmoid gates are a relaxation of the binary gate (0 or 1) ○ Forget gate Input gate Input gate Forget gate New information Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short-Term Memory (LSTM) ● Generate the next state from the cell Output gate Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short-Term Memory (LSTM) ● Long-term dependency can be learned ● Much more powerful than the vanilla RNNs 34-layer residual ○ ○ ○ image We can use long sequence data as input The structure with two current flows is similar to ResNet Uninterrupted gradient flow is possible through the cell over time steps 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000

Sequence Setups using RNN ● There are several different input and output setups in RNN Many-to-One/One-to-Many Many-to-Many Many-to-One One-to-Many (Seq2Seq)

Many-to-Many RNN ● Both input and output are sequence Assume that the input and output data are strongly aligned in training data ○ When the alignment is weak, the attention layer is added or Seq2Seq is used ■ Bi-directional RNN is more commonly used unless it is for a real-time system ○ Use cases ○ Video classification: image frames to label frames ■ Part-of-speech tagging: sentence to tags ■ Automatic music transcription: audio to note/pitch/beat/chord ■ Sound event detection: audio to events ■ Bi-directional RNN Uni-directional RNN (use both past and (use past information future information) only)

Convolutional Recurrent Neural Network (CRNN) ● When input is high-dimensional (image or audio), CNN and RNN are combined CNN provides the embedding vector which is used as input of RNN ○ Labels on frame-level Note/Pitch/Beat/Chord/Event Image/Audio Embedding CNN Video Audio

Language Model (LM) ● Predict the next word given a sequence of words Compute the probability distribution of the next word 𝑦 (#) given a sequence ○ of words 𝑦 (#&') , . . . , 𝑦 (') à 𝑄(𝑦 # |𝑦 #&' , . . . , 𝑦 ' ) 𝑦 (-) can be any word from the vocabulary 𝑊 = {𝑥 . , … , 𝑥 |0| } ■ The likelihood can be computed for a sentence ○ “am” “so” “full” 𝑄 𝑦 . , 𝑦 1 , . . . , 𝑦 2 𝑄 𝑦 - | 𝑦 -4. , . . . , 𝑦 . 2 = ∏ -3. ■ Trained in the many-to-many RNN setting ○ Word Embedding Transformer is more dominantly used these days ■ LM has many applications ○ “I” “am” “so” Text generation: predict the most likely words ■ Language Model using RNN Speech recognition (acoustic model + LM) ■ By replacing words with musical notes or MIDI events, we can build a “Musical Language Model” which can be used for music generation and automatic music transcription

Many-to-One RNN ● The input is a sequence and the output is a categorical label The input sequence can have a variable length! ○ Use cases ○ Text classification: sentence to labels (pos/neg) ■ Music genre/mood/audio scene classification and tagging: audio to labels ■ Video scene classification: image frames to labels ■ “park” “positive” Audio/image Embedding Word Embedding “You” “are” “awesome” Text Classification Audio/Video Scene Classification

One-to-Many RNN ● The input is a single shot of data and the output is sequence data This is regarded as a conditional generation model ○ Use cases ○ Image captioning: generate the text description of an image ■ Music playlist generation: playlist title (text) to a sequence of song embedding ■ vectors “The” “trees” “are” “yellow” track1 track2 track3 image Text Embedding Embedding <start> <start> “Bossa Nova Jazz Best” Music Playlist Generation Image Captioning

Many-to-One/One-to-Many (Seq2Seq) ● Both input and output are sequence Assume that the input and output data are not aligned ○ Regarded as an encoder-decoder RNN framework ○ Use cases ○ Machine translation: sentence to sentence (neural machine translation) ■ Speech recognition/note-level singing vice transcription: audio to text/note ■ “ 난 ” “ 너를 ” “ 사랑해 ” “ 난 ” “ 너를 ” “ 사랑해 ” The compressed latent vector of the input sequence. Decoder Encoder Decoder Encoder RNN RNN RNN RNN The decoder becomes a conditional text “ 난 ” “ 너를 ” <EOS> “ 난 ” “ 너를 ” “I” “love” “you” generation model <EOS> Machine Translation Speech Recognition Singing Voice Transcription Seq2Seq is a conditional language model!

MIR Tasks Using RNN ● Many-to-many RNN Vocal melody extraction ○ Polyphonic piano transcription ○ Beat tracking ○ Chord recognition ○ ● Many-to-one RNN Music auto-tagging ○

RNN and Musical Applications Juhan Nam Motivation When the output - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) RNN and Musical Applications Juhan Nam Motivation When the output is sequential, e.g., pitch estimation or note transcription, CNN predicts each of the successive outputs

CNN and Musical Applications Juhan Nam Motivation Sensory data (image or audio) have

Musical Instruments They sound different, even on the same note They require energy to

Fundamentals of Musical Acoustics Graduate School of Culture Technology, KAIST Juhan Nam

Musical Interfaces and Sequencers Graduate School of Culture Technology, KAIST Juhan Nam Musical

Musical Theatre Song: A Comprehensive Course In Selection, Preparation, And Presentation For The

Memorial Day Choral Festival May 24-27, 2019 A Musical Tribute To Americas Veterans A Musical

Musical Theatre Song: A Comprehensive Course in Selection, Preparation, and Presentation for the

How has Digital Technology changed our perception of musical performance and progress?

Presente ter: Wan Qi Choo } Digital musical instrument - interface: the 8 x 8 matrix of light

CTP431- Music and Audio Computing Musical Interface and Sequencer Graduate School of Culture

Alexander Kharuto Head of Department of Musical Informatics, Ass. Prof., Ph.D. (tech), Moscow P.

Stylistic Pastiche and Intertextuality in Musical Theatre: Practice and Theory Nick Braae

One-Slide Summary Musical Recursive transition networks and Backus-Naur Form context-free

The Extraction of Structure from a Musical Piece Kasper.Souren @ ircam.fr

Ho-Ho-Kus School Spring Musical 2016... Theatrical Staff Director: Michael Michaliszyn

Two-interval musical scales and binary structures in computer science and biology Soshinsky Ivan

Tauranga Musical Theatre BAYCOURT September 2020 Production Team Director Toni Henderson

Viscoelastic materials and viscothermal losses: theory and numerics with applications to

GeoShred A New Kind of Musical Instrument Pat Scandalis

GeoShred, A New Kind of Musical Instrument Pat Scandalis

The environment for musical learning A creative, confident learner

CTP431- Music and Audio Computing Musical Interface Graduate School of Culture Technology KAIST

TABLE OF CONTENTS LIST OF MUSICAL EXAMPLES

CTP431- Music and Audio Computing Sound Synthesis Graduate School of Culture Technology KAIST