GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Deep Learning: Training Juhan Nam
Training Deep Neural Networks Forward (hidden unit activation) Parametric ● Gradient-based Learning Backward Non-parametric (gradient flow) . . . 𝑦 𝑚(𝑧, % 𝑧) Layer2 Layer L Layer1 Layer L-1 Layer3 Layer4 𝜖𝑚 𝜖𝑚 𝜖𝑚 (') (&) ($) 𝜖𝑥 𝜖𝑥 𝜖𝑥 !" !" !" ● Issue: vanishing gradient or exploding gradient The gradient in lower layers is computed as a cascaded multiplication of ○ local gradients from upper layers Some elements can decay or explode exponentially ○ Learning is diluted or unstable ○
Training Deep Neural Networks Forward (hidden unit activation) Parametric ● Gradient-based Learning Backward Non-parametric (gradient flow) . . . 𝑦 𝑚(𝑧, % 𝑧) Layer2 Layer L Layer1 Layer L-1 Layer3 Layer4 𝜖𝑚 𝜖𝑚 𝜖𝑚 (') (&) ($) 𝜖𝑥 𝜖𝑥 𝜖𝑥 !" !" !" ● Remedy: keep the distribution of the hidden unit activations in a controlled range Normalize the input: once as a preprocessing across the entire training set ○ Set the variance of the randomly initialized weight: once as a model setup ○ Normalize the hidden units: run-time processing during training ○ à batch normalization
Input Normalization ● Standardization: zero mean and unit variance Mean Std dev. Division Subtraction ● PCA whitening: zero mean and decorrelated unit variance Mean Rotation & Subtraction Std dev. division Add a small number to the standard deviation in the division
Input Normalization ● In practice Zero mean and unit variance are commonly used in music classification ○ tasks when the input is log-compressed spectrogram (however, in image classification, the unit variance is not very common) PCA whiting is not very common ○ ● Common pitfall The mean and standard deviation must be computed only from the training ○ data (not the entire dataset) The mean and standard deviation from the training set should be ○ consistently used for the validation and test sets
Weight Initialization ● Setting the variance of the random numbers so that the variance of input is equal to the variance of output at each layer : speed up the training ● Glorot (or Xavier) initialization (2010) The variance is set to 𝜏 ! = " #$% $% ()%*+, -)./)1#$% &'( (2+,*+, -)./) #$% !"# ( 𝑔𝑏𝑜 $&' = ) ○ ! Concerned with both the forward and backward passes ○ When the activation function is tanh or sigmoid ○ ● He initialization (2015) The variance is set to 𝜏 ! = ! ○ #$% $% Concerned with the forward pass only ○ When the activation function is ReLU or its variants ○
Batch Normalization ● Normalize the output of each layer for a mini batch input as a run-time processing during training (Ioffe and Szegedy, 2015) First, normalize the filter output to have zero-mean and unit variance for the ○ mini batch Then, rescale and shift the normalized output ○ with its trainable parameters ( 𝛾, 𝛿 ) This makes the output exploit the non-linearity: ■ the input with zero mean and unit variance are mostly in the linear range (e.g., sigmoid or tanh) 1 1 -10 10 -10 10 0 0
Batch Normalization ● Implemented as an additional layer Located between the FC (or Conv) layer and the activation function layer ○ A simple element-wise scaling and shifting operation for the input (the mean ○ and variance of the mini batch is regarded as a constant vector) ● Batch normalization in the test phase We can use a single example in the test phase ○ Use the moving average of the mean and variance of mini batches in the ○ (+,-) = 𝛽 $ 𝜈 ) (+,-) + (1 − 𝛽) $ 𝜈 ) (/01) training phase: 𝜈 ) In summary, four types of parameters are included in the batch norm layer: ○ mean (moving avg), variance (moving avg), rescaling (trainable) and shift (trainable)
Batch Normalization ● Advantages Improve the gradient flow through the networks: allowing to use a higher ○ learning rate and, as a result, the training becomes much faster! Reduce the dependence on weight initialization ○ 0.8 0.7 0.6 Inception BN − Baseline 0.5 BN − x5 BN − x30 BN − x5 − Sigmoid Steps to match Inception 0.4 5M 10M 15M 20M 25M 30M Figure 2: Single crop validation accuracy of Inception ImageNet Classification and its batch-normalized variants, vs. the number of (Ioffe and Szegedy, 2015) training steps.
Optimization ● Stochastic gradient descent is the basic optimizer in deep learning 𝑦 !"# = 𝑦 ! − 𝛽𝛼𝑔(𝑦 ! ) ● However, it has limitations ������������������������������� Convergence of the loss is very slow when the loss function has high ○ condition number ������������������������������� If the gradient becomes zero, the update gets stuck: local minima or saddle ����������������� ○ ��������������� �������������������������������������������������������������������� ������������������������������� points ������������������������������ ������������� ��� ������������������������������������������������������������������������ ������������ � 𝛼𝑔 𝑦 $ = 0 𝛼𝑔 𝑦 $ = 0 ����������������� ��������������� ��������������� ����������������� ������������� ��� ���������� ������������ � ����������������������� ���������������� ������������������������������� Local Minimum Saddle point Condition number: the ratio of largest to smallest singular value ������������������������������������������ ������������������������������������������ ����������� ����������� �� �������������� �������������� ��������������������������������������������� of the Hessian matrix ��������������� Source: Stanford CS231n slides ������������������������������������������ ������������������������������������������ ����������� ����������� �� �������������� �������������� ����������������� ���������� ������������������������������������������ ������������������������������������������ ����������� ����������� �������������� �������������� ��
Recommend
More recommend