GCT634: Musical Applications of Machine Learning Deep Learning: Part 2 Graduate School of Culture Technology, KAIST Juhan Nam
Outlines • Convolutional Neural Networks (CNN) - Introduction - Mechanics - CNN for music classification • Training Neural Network Models - Preprocessing data - Building a model - Training
Convolutional Neural Network (CNN) C3: f. maps 16@10x10 C1: feature maps S4: f. maps 16@5x5 INPUT 6@28x28 32x32 S2: f. maps C5: layer OUTPUT F6: layer 6@14x14 120 10 84 Gaussian connections Full connection Subsampling Subsampling Full connection Convolutions Convolutions ( LeCun 98) • Neural network that contain convolutional layer and subsampling (or pooling) layer - Local filters (or weight) are convolved with input or hidden layers and return feature maps - The feature maps are sub-sampled (or pooled) to reduce the dimensionality
History • Highly related to human visual recognition - Receptive field, Simple/Complex cells (Hubel and Wiesel, 1962) - NeoCognition (Fukushima, 1980): early computational model - LeNet (LeCun, 1998): the first CNN model, applied to hand-written zip code (Manassi, 2013)
History • The breakthrough in image classification (2012) - CNN trained with 2 GPUs and 1.2M images during one week - ReLU (fast and non-saturated), dropout (regularization) - ImageNet challenge: top-5% error 15.3% (>10% lower than the second) - Opened the era of deep learning (Krizhevsky et. al., 2012)
ImageNet Challenge • 2010-11: hand-crafted features + classifiers • 2012-2016: ConvNets - 2012: AlexNet - 2013: ZFNet - 2014: VGGNet, InceptionNet - 2015: ResNet - 2016: Ensemble networks - 2017: Squeeze and Excitation Net (NIPS 2017 Tutorial – Deep Learning: Practice and Trend)
Hierarchical Representation Learning • Learned features are similar to those in the human visual system Low-Level Mid-Level High-Level Trainable Feature Feature Feature Classifier (Zeiler and Fergus, 2013) Borrowed from LeCun’s slides
Convolutional Neural Networks • ConvNet exploits these two properties - Locality : objects tend to have a local spatial support - Translation invariance : object appearance is independent of location The bird occupies a local area and looks the same in different part of an image (NIPS 2017 Tutorial – Deep Learning: Practice and Trend)
Convolutional Neural Networks • ConvNet exploits these two properties - Locality : objects tend to have a local spatial support - Translation invariance : object appearance is independent of location • Counter examples: face images (especially, passport photo) (from the MS-Celeb-1M dataset)
Convolutional Neural Networks • Locality and translation invariance appear in audio and text, too
Incorporating Assumptions: Locality • Make fully -connected layer locally -connected • Each neuron is connected to a local area ( receptive field) • Different neurons connected to different locations ( feature map) (NIPS 2017 Tutorial – Deep Learning: Practice and Trend)
Incorporating Assumptions: Translation Invariance • Weight sharing: units connected to different locations have the same weights (filters) • Convolutional layer: locally-connected layer with weight sharing • The weight are invariant, the output is equivalent (e.g. face recognition) (NIPS 2017 Tutorial – Deep Learning: Practice and Trend)
Convolution Mechanics • Image and Feature Map Note that they become a 4D tensor - 3D tensor: width, height and depth (channel) when batch or mini-batch is used - Channel (3): R, G, B Height Height Hidden Width unit Filter Width Channel Channel Filter must have the same This channel corresponds 2D convolution (or Depth) depth as the input has to the number of filters!
Convolution Mechanics • Stride - Sliding with hopping (equivalent to hop size in STFT) • Padding - Zero-padding to the border to adjust feature map size or to take care of striding (equivalent to zero-padding in STFT) Do zero-padding if The Output size is (N-F)/S +1 (N-F)/S is not integer (N: input size, F: filter size, S: stride size) • Convolution Animation - https://github.com/vdumoulin/conv_arithmetic
Convolution Mechanics Common settings: K = (powers of 2, e.g. 32, 64, 128, 512) - F = 3, S = 1, P = 1 - F = 5, S = 1, P = 2 - F = 5, S = 2, P = ? (whatever fits) - F = 1, S = 1, P = 0 (Stanford CS231n Slides) Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - Lecture 5 - April 18, 2017 April 18, 2017 62
Sub-Sampling (or Pooling) • Summarize the feature map into smaller-dimensional feature map • It is the core logic to make the features translation-invariant • Types - Max-pooling: most popular choice - Average pooling - Standard deviation pooling 1 5 2 4 2 x 2 max pooling 2 3 9 1 5 9 Stride with 2 5 3 3 4 8 4 7 8 2 2
ConvNet Demo: Image Classification • https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.h tml
Designing CNN for Music Classification • Input data - Spectrogram - Log-Spectrogram: Mel or Constant-Q - Raw Waveforms • CNN structure - 1D-CNN - 2D-CNN - Sample-CNN
1D CNN • Assumes the locality and translation invariance only on time axis - The receptive filter covers the whole frequency range (1D feature map) - The first fully-connected layer takes globally pooled features Frequency Channel output Time Convolution and Pooling layers Fully-Connected layers Spectrogram
1D CNN • Assumes the locality and translation invariance only on time axis - The receptive filter covers the whole frequency range (1D feature map) - The first fully-connected layer takes globally pooled features • Another view Frequency (Channel) Channel Time
1D CNN: Example • Dieleman (2014) - http://benanne.github.io/2014/08/05/spotify-cnns.html
1D CNN • Advantage - The 1D features map significantly reduces the number of parameters (compared to the 2D feature map) - Fast to train - Work well for small datasets • Disadvantage - Not invariant to pitch shifting - Key transpose changes the feature activation and different results
2D CNN • Assumes the translation invariance on both time and frequency - The receptive filter covers a time-frequency patch (typically 3x3) - Log-spec spectrogram is required as input
2D CNN: Example • Choi et. al. (2016) - VGGNet style FCN-4 Mel-spectrogram (input: 96 × 1366 × 1) Conv 3 × 3 × 128 MP ( 2 , 4 ) (output: 48 × 341 × 128) Conv 3 × 3 × 384 MP ( 4 , 5 ) (output: 24 × 85 × 384) Conv 3 × 3 × 768 MP ( 3 , 8 ) (output: 12 × 21 × 768) Conv 3 × 3 × 2048 MP ( 4 , 8 ) (output: 1 × 1 × 2048) Output 50 × 1 (sigmoid)
2D CNN • Advantage - Relatively invariant to pitch shifting - Learn more general features in the bottom layers - Exploit advanced techniques for image classification • Disadvantage - The 2D features map significantly increases the number of parameters (compared to the 2D feature map) - Require a large-scale dataset and accordingly more computational resources (e.g. GPU and memory)
- The receptive field can vary from frame-level (e.g. 256 sample) to sample- - The CNN must be sufficiently deep to learn the variations within a frame Output • End-to-end model that takes raw waveforms directly Convolution and Pooling layers level (e.g. 2 or 3 samples) Sample-CNN Conv1D Layer 1 Block 1 time BatchNorm relu Layer 2 Block 2 Dropout 1D convolutional blocks 1D convolutional blocks Large size filter & strides Layer 3 Block 3 Conv1D Conv1D Layer 4 Block 4 BatchNorm ... ... ... ... BatchNorm T×C relu GlobalAvgPool MaxPool 1 ×C FC multi-level global max pooling global max pooling 1 ×αC relu FC T×C sigmoid 1 ×C Scale T×C relu MaxPool
SMC2017-221 Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland SMC2017-222 Sample-CNN: Example 3 9 model, 19683 frames 59049 samples (2678 ms) as input • Lee et. al. (2017) layer stride output # of params conv 3-128 3 19683 × 128 512 - Short filters work better than long ones conv 3-128 1 19683 × 128 49280 maxpool 3 3 6561 × 128 conv 3-128 1 6561 × 128 49280 maxpool 3 3 2187 × 128 Sample-level raw waveform model conv 3-256 1 2187 × 256 98560 maxpool 3 3 729 × 256 conv 3-256 1 729 × 256 196864 maxpool 3 3 243 × 256 conv 3-256 1 243 × 256 196864 maxpool 3 3 81 × 256 conv 3-256 1 81 × 256 196864 maxpool 3 3 27 × 256 conv 3-256 1 27 × 256 196864 maxpool 3 3 9 × 256 conv 3-256 1 9 × 256 196864 maxpool 3 3 3 × 256 conv 3-512 1 3 × 512 393728 Sample-level strided convolution layer maxpool 3 3 1 × 512 conv 1-512 1 1 × 512 262656 dropout 0.5 1 × 512 − sigmoid 50 25650 − 1 . 9 × 10 6 Total params
Sample-CNN • Advantage - No need of tuning STFT and log-scale parameters - The (sub-)optimal parameters are different depending on data and tasks - No need of storing the preprocessed spectrogram • Disadvantage - More parameters and memory - Slow to train
Training Neural Network Models • Preprocessing data - Data augmentation - Normalization • Building a model - CNN structure: 1D or 2D, filter size/number, pooling size, … - Loss function - Batch normalization - Dropout, weight decay - Weight Initialization • Training - Loss optimization and monitoring loss - Early Stopping - Hyper-parameter optimization
Recommend
More recommend