GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) CNN and Musical Applications Juhan Nam
Motivation ● Sensory data (image or audio) have high-dimensionality Image: 256 x 256 pixels (commonly used size after crop and resize) ○ The average image resolution on ImageNet is 469x387 pixels) ■ Audio: 128 mel bins x 128 frames (commonly used 3 sec mel-spectrogram) ○ 44,100 or 22050 samples/sec ■ ● The fully-connected layer requires a large size of weight If the hidden layer size is 256 for 256x256 images, the number of ○ parameters is 256 x 256 x 3 (RGB) x 256 (hidden layer size) = 50M! ● Can we reduce the number of parameters?
Locality and Translation Invariance ● Locality: the objects of our interest tend to have a local spatial support Important parts of the object structures are locally correlated ○ ● Translation invariance: object appearance is independent of location
Locality and Translation Invariance ● Locality: the objects of our interest tend to have a local spatial support Important parts of the object structures are locally correlated ○ ● Translation invariance: object appearance is independent of location
Incorporating Locality ● Change the fully -connected layer to a locally -connected layer Each hidden unit is connected to a local area ( receptive field) ○ Different hidden units connected to different locations ( feature map) ○ Source: NIPS 2017 Tutorial, Deep Learning: Practice and Trend
Incorporating Translation Invariance ● Make the hidden units connected to different locations have the same weights ( weight sharing ) Convolutional layer: locally-connected layer with weight sharing ○ The weight are invariant to the location and the output is equivalent ○ Source: NIPS 2017 Tutorial, Deep Learning: Practice and Trend
Convolutional Neural Network (CNN) ● Consists of convolution layer and subsampling (or pooling ) layer Local filters (or weight ) are convolved with the input or hidden layers and ○ return feature maps The feature maps are sub-sampled (or pooled) to reduce the dimensionality ○ C3: f. maps 16@10x10 C1: feature maps S4: f. maps 16@5x5 INPUT 6@28x28 32x32 S2: f. maps C5: layer OUTPUT F6: layer 6@14x14 120 10 84 Gaussian connections Full connection Subsampling Subsampling Full connection Convolutions Convolutions LeNet-5 ( LeCun 98)
Convolutional Neural Network (CNN) ● The breakthrough in image classification (2012) CNN with more convolution and max-pooling layers ○ ReLU (fast and non-saturated), dropout (regularization) ○ Trained with 2 GPUs “directly” on 1.2M images during one week ○ ImageNet challenge: top-5% error 15.3% (>10% lower than the second) ○ ImageNet classification with deep convolutional neural networks, Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012
ImageNet Challenge ● CNN models have been deeper and deeper ������������������������������������������������������������������ ���������� ���������� ���������� Deep Learning Breakthrough ��������� ��������� Surpass human recognition ������� �������� �������� ������������������������������������������ ������������������������������������������ ����������� ����������� ����������� ����������� ��
Convolution Mechanics ● Image input and feature Map 3D tensor: width(W), height (H) and depth (channel) ○ Channel (C): R, G, B ○ The input data become a 4D tensor when a batch or mini-batch is used à N (example) x C x W x H Height Height Hidden Width unit Width Filter Channel Channel Filter must have the same This channel corresponds 2D convolution (or Depth) depth as the input has to the number of filters!
Convolution Mechanics ● Stride: sliding with hopping (equivalent to the hop size in STFT) ● Padding: adjust the feature map size by zero-padding to the border of the input No padding No padding Pad size=1 Pad size=1 No striding Stride size=2 Stride size=2 No Striding (Filter size: 3 x 3) Source: https://github.com/vdumoulin/conv_arithmetic
Sub-Sampling (or Pooling) ● Down-size the feature map by summarizing the local features ● Types Max-pooling: most popular choice ○ Average pooling, standard deviation pooling, L^p (power-average) pooling ○ 1 5 2 4 2 x 2 max pooling 2 3 9 1 5 9 Stride with 2 5 3 3 4 8 4 7 8 2 2
������������������������������������������ ������������������������������������������ ������������������ ������������������������������ � � � � ������������������������� ���������������������������� � � � ������ �������������������������������� ������������������ ��������� ������������������������� ���� ������������������������������������������ ��������������������������������������� CNN Architecture for Image Classification ● ResNet (deep and high performance) ● GooLeNet (efficiency) ● VGGNet (flexibility) 34-layer residual image ○ ○ ○ ○ Add skip connections between conv blocks: better gradient flow 1x1 filter: reduce the depth (significantly reduce parameters) Inception module: multiple parallel convolution layers Small filter size (3x3) 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 ������� ������������������������������������������ ������������������������������������������ ��������������������� 3x3 conv, 64 � ���������� ������������������������������������ ���������������������� 3x3 conv, 128, /2 � � � � � 3x3 conv, 128 ����������� ����������� ������������������ �������������������������������� ��������������������� ������������������������������������ ������������ ���������������������������� ��������� 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 ResNet 3x3 conv, 256, /2 3x3 conv, 256 One block of VGGNet 3x3 conv, 256 ����� �� 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 ����������� ����������� 3x3 conv, 256 3x3 conv, 256 ����� 3x3 conv, 256 3x3 conv, 512, /2 ���������������� 3x3 conv, 512 Depth:256 ����������� ����������� 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 1x1 Filter (64) 3x3 conv, 512 avg pool Depth:64 �� fc 1000 ����������� �����������
Classification-based MIR Tasks Using CNN “soft rock” ● Semantic-Level (long segment) Music genre/mood classification and auto-tagging ○ Music recommendation ○ “piano” “singing voice” ● Event-Level (note, beat or phrase) Onset Detection ○ Musical instrument recognition ○ Singing voice detection ○ pitch contour (The output is usually predicted in frame-level) (quantized) ● Frame-Level (single audio frame) Pitch estimation ○ Multiple F0 estimation ○
Recommend
More recommend