deep learning intro
play

Deep Learning: Intro Juhan Nam Review of Traditional Machine - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Deep Learning: Intro Juhan Nam Review of Traditional Machine Learning The traditional machine learning pipeline Temporal Frame-level Unsupervised Classifier Summary


  1. GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Deep Learning: Intro Juhan Nam

  2. Review of Traditional Machine Learning ● The traditional machine learning pipeline Temporal Frame-level Unsupervised Classifier Summary Features Learning

  3. Review of Traditional Machine Learning ● The traditional machine learning pipeline Temporal Frame-level Unsupervised Classifier Summary Features Learning K-means MFCC Logistic Regression Abs Mel Log Non-linear Temporal Linear DFT DCT (magnitude) Filterbank compression Transform Pooling Classifier

  4. Review of Traditional Machine Learning ● Each module can be replaced with a chain of linear transform and non- linear function Temporal Frame-level Unsupervised Classifier Summary Features Learning K-means MFCC Logistic Regression Abs Mel Log Non-linear Temporal Linear DFT DCT (magnitude) Filterbank compression Transform Pooling Classifier Non-linear Linear Non-linear Linear Linear Linear Linear Non-linear Temporal Transform function Transform Classifier Transform function Transform function Pooling

  5. Review of Traditional Machine Learning ● The entire modules can be replaced with a long chain of linear transform and non-linear function (i.e. deep neural network) Each module is locally optimized in the traditional machine learning ○ Temporal Frame-level Unsupervised Classifier Summary Features Learning K-means MFCC Logistic Regression Abs Mel Log Non-linear Temporal Linear DFT DCT (magnitude) Filterbank compression Transform Pooling Classifier Non-linear Linear Non-linear Linear Linear Linear Linear Non-linear Temporal Transform function Transform Classifier Transform function Transform function Pooling Deep Neural Network

  6. Deep Learning ● The entire blocks (or layers) are optimized in an end-to-end manner The parameters (or weights) in all layers are learned to minimize the loss ○ function of the classifier The loss is back-propagated through all layers (from right to left) as a ○ gradient with regard to each parameter (“error back-propagation”) ● Therefore, we “learn features” instead of designing or engineering them Non-linear Linear Non-linear Linear Linear Linear Linear Non-linear Temporal Transform function Transform Classifier Transform function Transform function Pooling Deep Neural Network

  7. Deep Learning: Building Models ● There are many choices of basic building blocks (or layers) ● Connectivity patterns (parametric) Fully-connected (i.e. linear transform) ○ Convolutional (note that STFT is a convolutional operation) ○ Skip / Residual ○ Recurrent ○ ● Nonlinearity functions (non-parametric) Sigmoid ○ Tanh ○ Rectified Linear Units (ReLU) and variations ○

  8. Deep Learning: Building Models ● We “design” a deep neural network architecture depending on the nature of data and the task Modular synth as a “musical analogy” ○ 1 �D�i�e�� ��ci��a���: a���� �he �a�age�e�� �f �he f�e��e�c� a�d i�����e 1 �D�i�e�� ��ci��a���: a���� �he �a�age�e�� �f �he f�e��e�c� a�d i�����e 1 �D�i�e�� ��ci��a���: a���� �he �a�age�e�� �f �he f�e��e�c� a�d i�����e �id�h �f �he 3 ���a�e� ��ci��a����. The�e 3 ���a�e� ��ci��a���� ca� be ���ed a�d �id�h �f �he 3 ���a�e� ��ci��a����. The�e 3 ���a�e� ��ci��a���� ca� be ���ed a�d �id�h �f �he 3 ���a�e� ��ci��a����. The�e 3 ���a�e� ��ci��a���� ca� be ���ed a�d The � � s�itch lets �ou choose: a monophonic pla�ing mode The � � button, acti�e �hen the s�nthesi�er is in monophonic mode, � or �glide� in English � oscillator modules (“Low Frequency Oscillator”) are used to The � � button, also acti�e �hen the s�nthesi�er is in monophonic mode, The “slave” oscillators can also be used as LFOs when they are brought to quencies (“low in �our pla�ing sequence. If, on the other hand, �ou don�t �ish to re freq”). This gives a total availability of 11 LFO modules. An o�cilla�or bank: 1 �dri�er� and 3 ��la�e o�cilla�or�� An o�cilla�or bank: 1 �dri�er� and 3 ��la�e o�cilla�or�� An o�cilla�or bank: 1 �dri�er� and 3 ��la�e o�cilla�or�� simultaneousl� (�pol�� screen. Thi – – – – – – oscillator modules (“Low Frequency Oscillator”) are used to To acti�ate the portamento mode, click on the �ON� button under the �i�h �he �o�a�ing �le�el� b���on and ampli��de mod�la�ion inp��. portamento intensit� knob (�glide�), situated ne�t to the 2 dials, on the � � �impl� click on �he �link� b���on �ha� �epa�a�e� �hem. The “slave” oscillators can also be used as LFOs when they are brought to � � quencies (“low �i�h �he �o�a�ing �le�el� b���on and ampli��de mod�la�ion inp��. � � freq”). This gives a total availability of 11 LFO modules. � � �impl� click on �he �link� b���on �ha� �epa�a�e� �hem. – – � � � � � – – � The por�amen�o (�Glide�) � � – – – – – – – – Se� �he �Range� of oscilla�or 3 �o 16. I� �ill pla� 1 oc�a�e �pper �han �he 3 o�hers (Images from the Arturia Modula V manual) – – � � in mode p�o�ide� acce�� �o man� ��ef�l fea���e�. Le��� look a� �hem in mode p�o�ide� acce�� �o man� ��ef�l fea���e�. Le��� look a� �hem – – – – Increase �he a��ack �ime of �he en�elope �o 2 o�clock in order �o make �he No� le��s make ano�her mod�la�ion appear gi�en b� �he �ill be pro�oked b� �he af�er�o�ch. For �his, connec� �he �Tri� o��p�� of �his LFO – – – – – –

  9. Deep Learning: Training Models ● Loss Function Cross entropy (logistic loss) ○ ● Hyper parameter (initialization, Hinge loss ○ regularization, model search) Maximum likelihood ○ ○ Weight initialization L2 (root mean square) and L1 ○ ○ L1 and L2 (Weight decay) Adversarial ○ ○ Dropout Variational ○ ○ Learning rate ● Optimizers ○ Layer size SGD ○ Batch size ○ Momentum ○ ○ Data augmentation RMSProp ○ Adagrad ○ Adam ○

  10. Multi-Layer Perceptron (MLP) ● Neural networks that consist of fully-connected layers and non-linear functions Also called Feedforward Neural Network or Deep Feedforward Network ○ A long history: perceptron (Rosenblatt, 1962), back-propagation (Rumelhart, ○ 1986), deep belief networks (Hinton and Salakhutdinov, 2006) 𝑧 = 𝑋 (&) ℎ (%) + 𝑐 (&) ℎ (") ℎ ($) ℎ (%) 𝑦 ℎ (%) = 𝑕 ( 𝑨 (%) ) 𝑧 𝑨 (%) = 𝑋 (%) ℎ ($) + 𝑐 (%) ℎ ($) = 𝑕 ( 𝑨 ($) ) 𝑨 ($) = 𝑋 ($) ℎ (") + 𝑐 ($) ℎ (") = 𝑕 ( 𝑨 (") ) Output layer Input layer 𝑨 (") = 𝑋 (") 𝑦 + 𝑐 (") Hidden layers

  11. Deep Feedforward Network ● It is argued that the first breakthrough of deep learning is from the deep feedforward network (2011) The state-of-the-art acoustic model (GMM-HMM) in the speech recognition ○ Replace the GMM module with a deep feedforward network (up to 5 layers) ○ Initialize the weights matrix using an unsupervised learning algorithm ○ Deep belief network: greedy layer-wise using restricted Boltzmann machine ■ Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, George Dahl, Dong Yu, Li Deng, Alex Acero, 2012

  12. Non-linear Functions ● There are several choices of non-linear functions (or activation functions) ReLU is the default choice in modern deep learning: fast and effective ○ There are also other choices such as Elastic Linear Unit and Maxout ○ Note that this is an element-wise operation in the neural network ○ 𝑓 " − 𝑓 !" Sigmoid Tanh 1 𝑓 " + 𝑓 !" 1 1 𝜏 𝑦 = 1 + 𝑓 !" -10 10 10 -10 0 0 Leaky ReLU ReLU 10 10 max(0, 𝑦) max(0.1𝑦, 𝑦) 10 -10 0 10 -10 0

  13. Why the Nonlinear Function in the Hidden Layer? ● They capture the interaction between input elements with high-order This enables finding non-linear boundaries between different classes ○ Taylor series of a nonlinear function 𝑕 𝑨 ○ Non-zero coefficients for high-order polynomials: ■ 𝑕 𝑨 = 𝑏 # + 𝑏 $ 𝑨 + 𝑏 % 𝑨 % + 𝑏 & 𝑨 & + ⋯ interactions between all input elements 𝑨 % = 𝑥 $ % + 2𝑥 $ 𝑥 % 𝑦 $ 𝑦 % + 𝑥 % % + 2𝑥 $ 𝑦 $ 𝑐 + 2𝑥 % 𝑦 % 𝑐 + 𝑐 % 𝑨 = 𝑥 $ 𝑦 $ + 𝑥 % 𝑦 % + 𝑐 % 𝑦 $ % 𝑦 % 𝑦 $ 𝑦 $ 𝑦 " 𝑦 " 𝑏 ! + 𝑏 " 𝑨 + 𝑏𝑨 # = 0 𝑨 = 0

  14. Why the Nonlinear Function in the Hidden Layer? ● What if the nonlinear functions are absent? Multiplications of linear transform = another linear transform ○ Geometrically, linear transformation does scaling, sheering and rotation ○ source: http://www.ams.org/publicoutreach/feature-column/fcarc-svd

Recommend


More recommend