Stochastic Gradient Descent (SGD) Gradient: Batch: [1..N] Noisy (‘Stochastic’) Gradient: b(1), b(2),…, b(B): sampled from [1,N] Minibatch: B elements Epoch: N samples, N/B batches EG Course Deep Learning for Graphics
Code example Gradient Descent vs Stochastic Gradient Descent 62 EG Course Deep Learning for Graphics
Regularization in SGD: Weight Decay Gradient: Batch: [1..N] Noisy (‘Stochastic’) Gradient: b(1), b(2),…, b(B): sampled from [1,N] Minibatch: B elements ‘’Weight decay’’ Back-prop on minibatch Epoch: N samples, N/B batches EG Course Deep Learning for Graphics
Learning rate EG Course Deep Learning for Graphics
Gradient Descent EG Course Deep Learning for Graphics
(S)GD with adaptable stepsize e.g. EG Course Deep Learning for Graphics
(S)GD with momentum Main idea: retain long-term trend of updates, drop oscillations (S)GD (S)GD + momentum EG Course Deep Learning for Graphics
Code example Multi-layer perceptron classification 68 EG Course Deep Learning for Graphics
Step-size Selection & Optimizers: research problem • Nesterov’s Accelerated Gradient (NAG) • R-prop • AdaGrad • RMSProp • AdaDelta • Adam • … EG Course Deep Learning for Graphics
Neural Network Training: Old & New Tricks Old: (80’s) Stochastic Gradient Descent, Momentum, “weight decay” New: (last 5-6 years) Dropout ReLUs Batch Normalization EG Course Deep Learning for Graphics
Linearization: may need higher dimensions http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ EG Course Deep Learning for Graphics
Reminder: Overfitting, in images Classification just right Regression EG Course Deep Learning for Graphics
Previously: l2 Regularization Per-sample loss Per-layer regularization EG Course Deep Learning for Graphics
Dropout Each sample is processed by a ‘decimated’ neural net Decimated nets: distinct classifiers But: they should all do the same job EG Course Deep Learning for Graphics
Dropout block EG Course Deep Learning for Graphics ‘Feature noising’
Test time: Deterministic Approximation EG Course Deep Learning for Graphics
Dropout Performance EG Course Deep Learning for Graphics
Neural Network Training: Old & New Tricks Old: (80’s) Stochastic Gradient Descent, Momentum, “weight decay” New: (last 5-6 years) Dropout ReLUs Batch Normalization EG Course Deep Learning for Graphics
‘Neuron’: Cascade of Linear and Nonlinear Function Sigmoidal (“logistic”) Rectified Linear Unit (RELU) EG Course Deep Learning for Graphics
Reminder: a network in backward mode Outputs Gradient signal scaling: <1 (actually <0.25) from above EG Course Deep Learning for Graphics
Vanishing Gradients Problem Gradient signal scaling: <1 (actually <0.25) from above Do this 10 times: updates in the first layers get minimal Top layer knows what to do, lower layers “don’t get it” Sigmoidal Unit: Signal is not getting through! EG Course Deep Learning for Graphics
Vanishing Gradients Problem: ReLU Solves It Gradient signal Scaling: {0,1} from above EG Course Deep Learning for Graphics
Neural Network Training: Old & New Tricks Old: (80’s) Stochastic Gradient Descent, Momentum, “weight decay” New: (last 5-6 years) Dropout ReLUs Batch Normalization EG Course Deep Learning for Graphics
External Covariate Shift: your input changes 10 am 2pm 7pm EG Course Deep Learning for Graphics
“Whitening”: Set Mean = 0, Variance = 1 Photometric transformation: I a I + b • Make each patch have zero mean: • Then make it have unit variance: EG Course Deep Learning for Graphics
Internal Covariate Shift Neural network activations during training: moving target EG Course Deep Learning for Graphics
Batch Normalization Whiten-as-you-go: EG Course Deep Learning for Graphics
Batch Normalization: used in all current systems EG Course Deep Learning for Graphics
Convolutional Neural Networks
Fully-connected Layer Example: 200x200 image 40K hidden units ~2B parameters!!! Spatial correlation is local - Waste of resources - we have not enough training samples anyway.. - EG Course Deep Learning for Graphics
Locally-connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). EG Course Deep Learning for Graphics
Locally-connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). EG Course Deep Learning for Graphics
Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels EG Course Deep Learning for Graphics
Convolutional Layer EG Course Deep Learning for Graphics
Convolutional Layer EG Course Deep Learning for Graphics
Convolutional Layer EG Course Deep Learning for Graphics
Convolutional Layer EG Course Deep Learning for Graphics
Convolutional Layer EG Course Deep Learning for Graphics
Convolutional Layer EG Course Deep Learning for Graphics
Convolutional Layer EG Course Deep Learning for Graphics
Recommend
More recommend