convolutional neural networks basics
play

Convolutional Neural Networks Basics Praveen Krishnan Overview - PowerPoint PPT Presentation

Convolutional Neural Networks Basics Praveen Krishnan Overview Paradigm Shift Simple Network Convolutional Network Layers Case Study1: Alex Net Training Generalization Visualizations Transfer Learning Case


  1. Convolutional Neural Networks Basics Praveen Krishnan

  2. Overview  Paradigm Shift  Simple Network  Convolutional Network  Layers  Case Study1: Alex Net  Training  Generalization  Visualizations  Transfer Learning  Case Study2: JAZ Net  Practical Aspects  Gradient checks.  Data  GPU Coding/Libraries

  3. Part Paradigm Shift Models Feature Sparrow Extraction Coding Pooling Classifier (SIFT, HoG,.) Sparrow Feature Learning Classifier (CNN, RBM, …) Layers - L 3 (Hierarchical L 1 L 2 L 4 decomposition)

  4. A simple network f 1 f 2 f n-1 f n x 0 x n-2 x n-1 x n x 1 w 1 w 2 w n-1 w n Here each output x j depends on previous input x j-1 through a function f j with parameters w j

  5. Feed forward neural network x n1 x 00 x n2 x 01 x nc x 0d W 1 W n Zooming In

  6. Feed forward neural network x n1 x 00 x n2 x 01 x nc x 0d W 1 W n z LOSS y = [0,0,…,1,…0]

  7. Feed forward neural network x n1 x n2 x nc W 1 W n z LOSS Weight updates using back propagation of gradients

  8. Convolutional Network Fully connected layer Locally connected layer 200x200x3 3x3x3 • #Hidden Units: 1200,00 • #Hidden Units: 1200,00 • #Params: 12 billion • #Params: 1.08 Million • Need huge training data to prevent • Useful when the image is highly registered over-fitting!

  9. Convolutional Network Convolutional layer 3x3x3 • #Hidden Units: 1200,00 • #Params: 27 • #feature map: 1 • Exploiting the stationarity property.

  10. Convolutional Network Receptive field Convolutional layer 3 3 200 3 3 # feature maps • Use of multiple feature maps. • Sharing parameters • Exploits stationarity of statistics. • Preserves locality of pixel dependencies.

  11. Convolutional Network 200x200x3 Image size: W1xH1xD1 Receptive field size: FxF #Feature maps: K Q. Find out W2,H2 and D2 ?

  12. Convolutional Network 200x200x3 Image size: W1xH1xD1 It is also better to do Receptive field size: FxF zero padding to #Feature maps: K preserve input size spatially. W2=(W1-F)/S+1 H2=(H1-F)/S+1 D2=K

  13. Convolutional Layer y 1 n y 2 n Conv. x 1 n-1 Layer x 2 n-1 x 3 n-1 Here “f” is a non -linear activation function. F= no. of input feature maps n= layer index “*” represents convolution/correlation ? Q. Is there a difference between correlation and convolution in learned network?

  14. Activation Functions Sigmoid tanh ReLU maxout Leaky ReLU

  15. A Typical Supervised CNN Architecture  A typical deep convolutional network SOFTMAX CONV NORM CONV NORM POOL POOL FC  Other layers  Pooling  Normalization  Fully connected  etc.

  16. SOFTMAX CONV NORM CONV NORM POOL POOL FC Pooling Layer Pool Size: 2x2 Stride: 2 2 8 9 4 Type: Max 8 9 3 6 5 7 5 7 3 1 6 4 2 5 7 3 Max pooling  Aggregation over space or feature type.  Invariance to image transformation and increases compactness to representation.  Pooling types: Max, Average, L2 etc.

  17. SOFTMAX CONV NORM CONV NORM POOL POOL FC Normalization  Local contrast normalization (Jarrett et.al ICCV‟09)  reduce illumination artifacts.  performs local subtractive and divisive normalization.  Local response normalization (Krizhevesky et.al. NIPS‟12)  form of lateral inhibition across channels.  Batch normalization (More later)

  18. SOFTMAX CONV NORM CONV NORM POOL POOL FC Fully connected  Multi layer perceptron  Role of an classifier**  Generally used in final layers to classify the object represented in terms of discriminative parts and higher semantic entities.

  19. Case Study: AlexNet  Winner of ImageNet LSVRC-2012.  Trained over 1.2M images using SGD with regularization.  Deep architecture (60M parameters.)  Optimized GPU implementation (cuda-convnet) Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012. - Cited by 11915

  20. deep convolutional neural networks." NIPS 2012. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with Case Study: AlexNet CONV 11x11x96 LRN MAX POOL 2x2 CONV 5x5x256 LRN MAX POOL 2x2 CONV 3x3x384 MAX POOL 2x2 CONV 3x3x384 MAX POOL 2x2 CONV 3x3x256 MAX POOL 2x2 FC - 4096 FC - 4096 SOFTMAX - 1000

  21. Training LOSS  Learning: Minimizing the loss function (incl. FC regularization) w.r.t. parameters of the network. Filter weights NORM POOL  Mini batch stochastic gradient descent CONV  Sample a batch of data.  Forward propagation NORM POOL  Backward propagation CONV  Parameter update x n y n

  22. Training LOSS  Back propagation FC  Consider an layer f with parameters w : Here z is scalar which is the loss computed from NORM loss function h. The derivative of loss function w.r.t POOL to parameters is given as: CONV NORM Recursive eq. which POOL is applicable to each CONV layer x n y n

  23. Training LOSS  Parameter update FC  Stochastic gradient descent Here η is the learning rate and θ is the set of all parameters NORM  Stochastic gradient descent with momentum POOL CONV NORM POOL CONV More in coming slides… x n y n

  24. Training LOSS  Loss functions. FC Measures the compatibility between prediction and ground truth.  one vs. rest classification  Soft-max classifier (cross entropy loss) NORM POOL CONV Derivative w.r.t. x i NORM POOL CONV Proof? x n y n

  25. Training LOSS  Loss functions. FC  one vs. rest classification  Hinge Loss Hinge loss is a convex function but not differentiable but sub-gradient exists. NORM POOL Sub-gradient w.r.t. x i CONV NORM POOL CONV x n y n

  26. Training LOSS  Loss functions. FC  Regression  Euclidean loss / Squared loss NORM Derivative w.r.t. x i POOL CONV NORM POOL CONV x n y n

  27. Training  Visualization of loss function Typically viewed as Initialization highly non-convex function but more recently it‟s believed to have smoother surfaces but with many Loss saddle regions ! θ Step direction Step size/learning rate Momentum

  28. Loss Training θ  Momentum  Better convergence rates.  Physical perspective: Affects velocity of the update.  Higher velocity in the consistent direction of gradient.  Momentum update: Position Velocity

  29. Loss Training θ  Learning Rates ( η )  Controls the kinetic energy of the updates.  Important to know when the decay the η .  Common methods (Annealing):-  Step decay  Exponential/log space decay  Manual  Adaptive learning methods  Adagrad  RMSprop Figure courtesy: Fei Fei et al. , cs231n

  30. Loss Training θ  Initialization  Never initialize weights to all zero‟s or same value. (Why?)  Popular techniques:-  Random values sampled from N(0,1)  Xavier (Glorot et.al JMLR‟10)  Scale of initialization is dependent on the number of input and output neurons.  Initial weights are sampled from N(0,var(w)) Fan-in Fan-out  Pre-training  Using RBMs. (Hinton et.al, Science 2006)

  31. Training  Generalization  How to prevent? val-2 accuracy (overfitting)  Underfitting – Deeper n/ w‟s  Overfitting top5- error  Stopping at the right time. val-1 accuracy (*)  Weight penalties.  L1  L2  Max norm training accuracy  Dropout  Model ensembles epoch  E.g. Same model, different initializations.

  32. Generalization  Dropouts  Stochastic regularization.  Idea applicable to many other networks.  Dropping out hidden units randomly Before dropout with fixed probability „p‟ (say 0.5) temporarily while training.  While testing the all units are preserved but scaled with „p‟.  Dropouts along with max norm constraint is found to be useful. After dropout Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. JMLR 2014

  33. Generalization  Without dropout  With dropout Features learned with one hidden layers auto-encoder on MNIST dataset. Sparsity Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. JMLR 2014

  34. Generalization  Batch Normalization  Covariate Shift  defined as a change in the distribution of a function‟s domain  Mini-batches (randomized) reduces the effect of covariate shift  Internal Covariate Shift go water the plants got water in kite bang eat your pants face monkey  Current layer parameters change the distribution of the input to successive layers.  Slows down training and careful initialization. Image Credit: https://gab41.lab41.org/batch-normalization-what-the-hey-d480039a9e3b

  35. Generalization  Batch Normalization  Fixes the distribution of layer input as training progresses.  Faster convergence. Ioffe, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arxiv'15

  36. Some results on ImageNet GoogLeNet Clarifai AlexNet Source: Krizhevsky et.al. NIPS‟12 T op-5 classification accuracy

Recommend


More recommend