Deep Learning: Part 2 Graduate School of Culture Technology, KAIST - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Deep Learning: Part 2 Graduate School of Culture Technology, KAIST Juhan Nam

Outlines • Convolutional Neural Networks (CNN) - Introduction - Mechanics - CNN for music classification • Training Neural Network Models - Preprocessing data - Building a model - Training

Convolutional Neural Network (CNN) C3: f. maps 16@10x10 C1: feature maps S4: f. maps 16@5x5 INPUT 6@28x28 32x32 S2: f. maps C5: layer OUTPUT F6: layer 6@14x14 120 10 84 Gaussian connections Full connection Subsampling Subsampling Full connection Convolutions Convolutions ( LeCun 98) • Neural network that contain convolutional layer and subsampling (or pooling) layer - Local filters (or weight) are convolved with input or hidden layers and return feature maps - The feature maps are sub-sampled (or pooled) to reduce the dimensionality

History • Highly related to human visual recognition - Receptive field, Simple/Complex cells (Hubel and Wiesel, 1962) - NeoCognition (Fukushima, 1980): early computational model - LeNet (LeCun, 1998): the first CNN model, applied to hand-written zip code (Manassi, 2013)

History • The breakthrough in image classification (2012) - CNN trained with 2 GPUs and 1.2M images during one week - ReLU (fast and non-saturated), dropout (regularization) - ImageNet challenge: top-5% error 15.3% (>10% lower than the second) - Opened the era of deep learning (Krizhevsky et. al., 2012)

ImageNet Challenge • 2010-11: hand-crafted features + classifiers • 2012-2016: ConvNets - 2012: AlexNet - 2013: ZFNet - 2014: VGGNet, InceptionNet - 2015: ResNet - 2016: Ensemble networks - 2017: Squeeze and Excitation Net (NIPS 2017 Tutorial – Deep Learning: Practice and Trend)

Hierarchical Representation Learning • Learned features are similar to those in the human visual system Low-Level Mid-Level High-Level Trainable Feature Feature Feature Classifier (Zeiler and Fergus, 2013) Borrowed from LeCun’s slides

Convolutional Neural Networks • ConvNet exploits these two properties - Locality : objects tend to have a local spatial support - Translation invariance : object appearance is independent of location The bird occupies a local area and looks the same in different part of an image (NIPS 2017 Tutorial – Deep Learning: Practice and Trend)

Convolutional Neural Networks • ConvNet exploits these two properties - Locality : objects tend to have a local spatial support - Translation invariance : object appearance is independent of location • Counter examples: face images (especially, passport photo) (from the MS-Celeb-1M dataset)

Convolutional Neural Networks • Locality and translation invariance appear in audio and text, too

Incorporating Assumptions: Locality • Make fully -connected layer locally -connected • Each neuron is connected to a local area ( receptive field) • Different neurons connected to different locations ( feature map) (NIPS 2017 Tutorial – Deep Learning: Practice and Trend)

Incorporating Assumptions: Translation Invariance • Weight sharing: units connected to different locations have the same weights (filters) • Convolutional layer: locally-connected layer with weight sharing • The weight are invariant, the output is equivalent (e.g. face recognition) (NIPS 2017 Tutorial – Deep Learning: Practice and Trend)

Convolution Mechanics • Image and Feature Map Note that they become a 4D tensor - 3D tensor: width, height and depth (channel) when batch or mini-batch is used - Channel (3): R, G, B Height Height Hidden Width unit Filter Width Channel Channel Filter must have the same This channel corresponds 2D convolution (or Depth) depth as the input has to the number of filters!

Convolution Mechanics • Stride - Sliding with hopping (equivalent to hop size in STFT) • Padding - Zero-padding to the border to adjust feature map size or to take care of striding (equivalent to zero-padding in STFT) Do zero-padding if The Output size is (N-F)/S +1 (N-F)/S is not integer (N: input size, F: filter size, S: stride size) • Convolution Animation - https://github.com/vdumoulin/conv_arithmetic

Convolution Mechanics Common settings: K = (powers of 2, e.g. 32, 64, 128, 512) - F = 3, S = 1, P = 1 - F = 5, S = 1, P = 2 - F = 5, S = 2, P = ? (whatever fits) - F = 1, S = 1, P = 0 (Stanford CS231n Slides) Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 18, 2017 April 18, 2017 62

Sub-Sampling (or Pooling) • Summarize the feature map into smaller-dimensional feature map • It is the core logic to make the features translation-invariant • Types - Max-pooling: most popular choice - Average pooling - Standard deviation pooling 1 5 2 4 2 x 2 max pooling 2 3 9 1 5 9 Stride with 2 5 3 3 4 8 4 7 8 2 2

ConvNet Demo: Image Classification • https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.h tml

Designing CNN for Music Classification • Input data - Spectrogram - Log-Spectrogram: Mel or Constant-Q - Raw Waveforms • CNN structure - 1D-CNN - 2D-CNN - Sample-CNN

1D CNN • Assumes the locality and translation invariance only on time axis - The receptive filter covers the whole frequency range (1D feature map) - The first fully-connected layer takes globally pooled features Frequency Channel output Time Convolution and Pooling layers Fully-Connected layers Spectrogram

1D CNN • Assumes the locality and translation invariance only on time axis - The receptive filter covers the whole frequency range (1D feature map) - The first fully-connected layer takes globally pooled features • Another view Frequency (Channel) Channel Time

1D CNN: Example • Dieleman (2014) - http://benanne.github.io/2014/08/05/spotify-cnns.html

1D CNN • Advantage - The 1D features map significantly reduces the number of parameters (compared to the 2D feature map) - Fast to train - Work well for small datasets • Disadvantage - Not invariant to pitch shifting - Key transpose changes the feature activation and different results

2D CNN • Assumes the translation invariance on both time and frequency - The receptive filter covers a time-frequency patch (typically 3x3) - Log-spec spectrogram is required as input

2D CNN: Example • Choi et. al. (2016) - VGGNet style FCN-4 Mel-spectrogram (input: 96 × 1366 × 1) Conv 3 × 3 × 128 MP ( 2 , 4 ) (output: 48 × 341 × 128) Conv 3 × 3 × 384 MP ( 4 , 5 ) (output: 24 × 85 × 384) Conv 3 × 3 × 768 MP ( 3 , 8 ) (output: 12 × 21 × 768) Conv 3 × 3 × 2048 MP ( 4 , 8 ) (output: 1 × 1 × 2048) Output 50 × 1 (sigmoid)

2D CNN • Advantage - Relatively invariant to pitch shifting - Learn more general features in the bottom layers - Exploit advanced techniques for image classification • Disadvantage - The 2D features map significantly increases the number of parameters (compared to the 2D feature map) - Require a large-scale dataset and accordingly more computational resources (e.g. GPU and memory)

- The receptive field can vary from frame-level (e.g. 256 sample) to sample- - The CNN must be sufficiently deep to learn the variations within a frame Output • End-to-end model that takes raw waveforms directly Convolution and Pooling layers level (e.g. 2 or 3 samples) Sample-CNN Conv1D Layer 1 Block 1 time BatchNorm relu Layer 2 Block 2 Dropout 1D convolutional blocks 1D convolutional blocks Large size filter & strides Layer 3 Block 3 Conv1D Conv1D Layer 4 Block 4 BatchNorm ... ... ... ... BatchNorm T×C relu GlobalAvgPool MaxPool 1 ×C FC multi-level global max pooling global max pooling 1 ×αC relu FC T×C sigmoid 1 ×C Scale T×C relu MaxPool

SMC2017-221 Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland SMC2017-222 Sample-CNN: Example 3 9 model, 19683 frames 59049 samples (2678 ms) as input • Lee et. al. (2017) layer stride output # of params conv 3-128 3 19683 × 128 512 - Short filters work better than long ones conv 3-128 1 19683 × 128 49280 maxpool 3 3 6561 × 128 conv 3-128 1 6561 × 128 49280 maxpool 3 3 2187 × 128 Sample-level raw waveform model conv 3-256 1 2187 × 256 98560 maxpool 3 3 729 × 256 conv 3-256 1 729 × 256 196864 maxpool 3 3 243 × 256 conv 3-256 1 243 × 256 196864 maxpool 3 3 81 × 256 conv 3-256 1 81 × 256 196864 maxpool 3 3 27 × 256 conv 3-256 1 27 × 256 196864 maxpool 3 3 9 × 256 conv 3-256 1 9 × 256 196864 maxpool 3 3 3 × 256 conv 3-512 1 3 × 512 393728 Sample-level strided convolution layer maxpool 3 3 1 × 512 conv 1-512 1 1 × 512 262656 dropout 0.5 1 × 512 − sigmoid 50 25650 − 1 . 9 × 10 6 Total params

Sample-CNN • Advantage - No need of tuning STFT and log-scale parameters - The (sub-)optimal parameters are different depending on data and tasks - No need of storing the preprocessed spectrogram • Disadvantage - More parameters and memory - Slow to train

Training Neural Network Models • Preprocessing data - Data augmentation - Normalization • Building a model - CNN structure: 1D or 2D, filter size/number, pooling size, … - Loss function - Batch normalization - Dropout, weight decay - Weight Initialization • Training - Loss optimization and monitoring loss - Early Stopping - Hyper-parameter optimization

Deep Learning: Part 2 Graduate School of Culture Technology, KAIST - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Deep Learning: Part 2 Graduate School of Culture Technology, KAIST Juhan Nam Outlines Convolutional Neural Networks (CNN) - Introduction - Mechanics - CNN for music classification Training

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Where Are We? Lecture 9 Robustness through Training 1 Robustness Explicit Handling of Noise

Discrete Mathematics & Mathematical Reasoning Algorithms Colin Stirling Informatics Some

Complexity of factoring polynomials with rational number coefficients Mark van Hoeij Florida

The Complexity of Computing the Sign of the Tutte Polynomial Leslie Ann Goldberg (based on joint

STAT 113: TOPIC OUTLINE (FINAL EXAM) COLIN REIMER DAWSON, FALL 2015 The final exam will cover the

STAT2201 Analysis of Engineering & Scientific Data Unit 8 Slava Vaisman The University of

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fMRI Data Analysis

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Deep Learning: Part 2 Graduate School of Culture Technology, KAIST - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Deep Learning: Part 2 Graduate School of Culture Technology, KAIST Juhan Nam Outlines Convolutional Neural Networks (CNN) - Introduction - Mechanics - CNN for music classification Training

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Where Are We? Lecture 9 Robustness through Training 1 Robustness Explicit Handling of Noise

Discrete Mathematics &amp; Mathematical Reasoning Algorithms Colin Stirling Informatics Some

Complexity of factoring polynomials with rational number coefficients Mark van Hoeij Florida

The Complexity of Computing the Sign of the Tutte Polynomial Leslie Ann Goldberg (based on joint

STAT 113: TOPIC OUTLINE (FINAL EXAM) COLIN REIMER DAWSON, FALL 2015 The final exam will cover the

STAT2201 Analysis of Engineering &amp; Scientific Data Unit 8 Slava Vaisman The University of

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fMRI Data Analysis

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Discrete Mathematics & Mathematical Reasoning Algorithms Colin Stirling Informatics Some

STAT2201 Analysis of Engineering & Scientific Data Unit 8 Slava Vaisman The University of