Convolutional Neural Networks Basics Praveen Krishnan
Overview Paradigm Shift Simple Network Convolutional Network Layers Case Study1: Alex Net Training Generalization Visualizations Transfer Learning Case Study2: JAZ Net Practical Aspects Gradient checks. Data GPU Coding/Libraries
Part Paradigm Shift Models Feature Sparrow Extraction Coding Pooling Classifier (SIFT, HoG,.) Sparrow Feature Learning Classifier (CNN, RBM, …) Layers - L 3 (Hierarchical L 1 L 2 L 4 decomposition)
A simple network f 1 f 2 f n-1 f n x 0 x n-2 x n-1 x n x 1 w 1 w 2 w n-1 w n Here each output x j depends on previous input x j-1 through a function f j with parameters w j
Feed forward neural network x n1 x 00 x n2 x 01 x nc x 0d W 1 W n Zooming In
Feed forward neural network x n1 x 00 x n2 x 01 x nc x 0d W 1 W n z LOSS y = [0,0,…,1,…0]
Feed forward neural network x n1 x n2 x nc W 1 W n z LOSS Weight updates using back propagation of gradients
Convolutional Network Fully connected layer Locally connected layer 200x200x3 3x3x3 • #Hidden Units: 1200,00 • #Hidden Units: 1200,00 • #Params: 12 billion • #Params: 1.08 Million • Need huge training data to prevent • Useful when the image is highly registered over-fitting!
Convolutional Network Convolutional layer 3x3x3 • #Hidden Units: 1200,00 • #Params: 27 • #feature map: 1 • Exploiting the stationarity property.
Convolutional Network Receptive field Convolutional layer 3 3 200 3 3 # feature maps • Use of multiple feature maps. • Sharing parameters • Exploits stationarity of statistics. • Preserves locality of pixel dependencies.
Convolutional Network 200x200x3 Image size: W1xH1xD1 Receptive field size: FxF #Feature maps: K Q. Find out W2,H2 and D2 ?
Convolutional Network 200x200x3 Image size: W1xH1xD1 It is also better to do Receptive field size: FxF zero padding to #Feature maps: K preserve input size spatially. W2=(W1-F)/S+1 H2=(H1-F)/S+1 D2=K
Convolutional Layer y 1 n y 2 n Conv. x 1 n-1 Layer x 2 n-1 x 3 n-1 Here “f” is a non -linear activation function. F= no. of input feature maps n= layer index “*” represents convolution/correlation ? Q. Is there a difference between correlation and convolution in learned network?
Activation Functions Sigmoid tanh ReLU maxout Leaky ReLU
A Typical Supervised CNN Architecture A typical deep convolutional network SOFTMAX CONV NORM CONV NORM POOL POOL FC Other layers Pooling Normalization Fully connected etc.
SOFTMAX CONV NORM CONV NORM POOL POOL FC Pooling Layer Pool Size: 2x2 Stride: 2 2 8 9 4 Type: Max 8 9 3 6 5 7 5 7 3 1 6 4 2 5 7 3 Max pooling Aggregation over space or feature type. Invariance to image transformation and increases compactness to representation. Pooling types: Max, Average, L2 etc.
SOFTMAX CONV NORM CONV NORM POOL POOL FC Normalization Local contrast normalization (Jarrett et.al ICCV‟09) reduce illumination artifacts. performs local subtractive and divisive normalization. Local response normalization (Krizhevesky et.al. NIPS‟12) form of lateral inhibition across channels. Batch normalization (More later)
SOFTMAX CONV NORM CONV NORM POOL POOL FC Fully connected Multi layer perceptron Role of an classifier** Generally used in final layers to classify the object represented in terms of discriminative parts and higher semantic entities.
Case Study: AlexNet Winner of ImageNet LSVRC-2012. Trained over 1.2M images using SGD with regularization. Deep architecture (60M parameters.) Optimized GPU implementation (cuda-convnet) Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012. - Cited by 11915
deep convolutional neural networks." NIPS 2012. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with Case Study: AlexNet CONV 11x11x96 LRN MAX POOL 2x2 CONV 5x5x256 LRN MAX POOL 2x2 CONV 3x3x384 MAX POOL 2x2 CONV 3x3x384 MAX POOL 2x2 CONV 3x3x256 MAX POOL 2x2 FC - 4096 FC - 4096 SOFTMAX - 1000
Training LOSS Learning: Minimizing the loss function (incl. FC regularization) w.r.t. parameters of the network. Filter weights NORM POOL Mini batch stochastic gradient descent CONV Sample a batch of data. Forward propagation NORM POOL Backward propagation CONV Parameter update x n y n
Training LOSS Back propagation FC Consider an layer f with parameters w : Here z is scalar which is the loss computed from NORM loss function h. The derivative of loss function w.r.t POOL to parameters is given as: CONV NORM Recursive eq. which POOL is applicable to each CONV layer x n y n
Training LOSS Parameter update FC Stochastic gradient descent Here η is the learning rate and θ is the set of all parameters NORM Stochastic gradient descent with momentum POOL CONV NORM POOL CONV More in coming slides… x n y n
Training LOSS Loss functions. FC Measures the compatibility between prediction and ground truth. one vs. rest classification Soft-max classifier (cross entropy loss) NORM POOL CONV Derivative w.r.t. x i NORM POOL CONV Proof? x n y n
Training LOSS Loss functions. FC one vs. rest classification Hinge Loss Hinge loss is a convex function but not differentiable but sub-gradient exists. NORM POOL Sub-gradient w.r.t. x i CONV NORM POOL CONV x n y n
Training LOSS Loss functions. FC Regression Euclidean loss / Squared loss NORM Derivative w.r.t. x i POOL CONV NORM POOL CONV x n y n
Training Visualization of loss function Typically viewed as Initialization highly non-convex function but more recently it‟s believed to have smoother surfaces but with many Loss saddle regions ! θ Step direction Step size/learning rate Momentum
Loss Training θ Momentum Better convergence rates. Physical perspective: Affects velocity of the update. Higher velocity in the consistent direction of gradient. Momentum update: Position Velocity
Loss Training θ Learning Rates ( η ) Controls the kinetic energy of the updates. Important to know when the decay the η . Common methods (Annealing):- Step decay Exponential/log space decay Manual Adaptive learning methods Adagrad RMSprop Figure courtesy: Fei Fei et al. , cs231n
Loss Training θ Initialization Never initialize weights to all zero‟s or same value. (Why?) Popular techniques:- Random values sampled from N(0,1) Xavier (Glorot et.al JMLR‟10) Scale of initialization is dependent on the number of input and output neurons. Initial weights are sampled from N(0,var(w)) Fan-in Fan-out Pre-training Using RBMs. (Hinton et.al, Science 2006)
Training Generalization How to prevent? val-2 accuracy (overfitting) Underfitting – Deeper n/ w‟s Overfitting top5- error Stopping at the right time. val-1 accuracy (*) Weight penalties. L1 L2 Max norm training accuracy Dropout Model ensembles epoch E.g. Same model, different initializations.
Generalization Dropouts Stochastic regularization. Idea applicable to many other networks. Dropping out hidden units randomly Before dropout with fixed probability „p‟ (say 0.5) temporarily while training. While testing the all units are preserved but scaled with „p‟. Dropouts along with max norm constraint is found to be useful. After dropout Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. JMLR 2014
Generalization Without dropout With dropout Features learned with one hidden layers auto-encoder on MNIST dataset. Sparsity Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. JMLR 2014
Generalization Batch Normalization Covariate Shift defined as a change in the distribution of a function‟s domain Mini-batches (randomized) reduces the effect of covariate shift Internal Covariate Shift go water the plants got water in kite bang eat your pants face monkey Current layer parameters change the distribution of the input to successive layers. Slows down training and careful initialization. Image Credit: https://gab41.lab41.org/batch-normalization-what-the-hey-d480039a9e3b
Generalization Batch Normalization Fixes the distribution of layer input as training progresses. Faster convergence. Ioffe, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arxiv'15
Some results on ImageNet GoogLeNet Clarifai AlexNet Source: Krizhevsky et.al. NIPS‟12 T op-5 classification accuracy
Recommend
More recommend