Case Study: AlexNet [Krizhevsky et al. 2012] Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2 [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores)
Case Study: AlexNet [Krizhevsky et al. 2012] Full (simplified) AlexNet architecture: [227x227x3] INPUT Compared to LeCun 1998: [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer 1 DATA: [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 - More data: 10^6 vs. 10^3 [13x13x256] NORM2: Normalization layer 2 COMPUTE: [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - GPU (~20x speedup) [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 3 ALGORITHM: [6x6x256] MAX POOL3: 3x3 filters at stride 2 - Deeper: More layers (8 weight layers) [4096] FC6: 4096 neurons - Fancy regularization (dropout) [4096] FC7: 4096 neurons - Fancy non-linearity (ReLU) [1000] FC8: 1000 neurons (class scores) 4 INFRASTRUCTURE: - CUDA
Case Study: AlexNet [Krizhevsky et al. 2012] Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 Details/Retrospectives: [27x27x96] MAX POOL1: 3x3 filters at stride 2 - first use of ReLU [27x27x96] NORM1: Normalization layer - used Norm layers (not common anymore) [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 - heavy data augmentation [13x13x256] MAX POOL2: 3x3 filters at stride 2 - dropout 0.5 [13x13x256] NORM2: Normalization layer - batch size 128 [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - SGD Momentum 0.9 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 - Learning rate 1e-2, reduced by 10 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 manually when val accuracy plateaus [6x6x256] MAX POOL3: 3x3 filters at stride 2 - L2 weight decay 5e-4 [4096] FC6: 4096 neurons - 7 CNN ensemble: 18.2% -> 15.4% [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores)
Case Study: ZFNet [Zeiler and Fergus, 2013] AlexNet but: CONV1: change from (11x11 stride 4) to (7x7 stride 2) CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512 ImageNet top 5 error: 15.4% -> 14.8%
Case Study: VGGNet [Simonyan and Zisserman, 2014] Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2 best model 11.2% top 5 error in ILSVRC 2013 -> 7.3% top 5 error
(not counting biases) INPUT: [224x224x3] memory: 224*224*3=150K params: 0 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
(not counting biases) INPUT: [224x224x3] memory: 224*224*3=150K params: 0 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters
(not counting biases) INPUT: [224x224x3] memory: 224*224*3=150K params: 0 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 Note: CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0 Most memory is in CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 early CONV CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0 Most params are CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters
Case Study: GoogLeNet [Szegedy et al., 2014] Inception module ILSVRC 2014 winner (6.7% top 5 error)
Case Study: GoogLeNet Fun features: - Only 5 million params! (Removes FC layers completely) Compared to AlexNet: - 12X less params - 2x more compute - 6.67% (vs. 16.4%)
Case Study: ResNet [He et al., 2015] ILSVRC 2015 winner (3.6% top 5 error) Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w
(slide from Kaiming He’s recent presentation)
Case Study: ResNet 224x224x3 spatial dimension [He et al., 2015] only 56x56!
Identity Mappings in Deep Residual Networks, He et al. 2016
Deep Networks with Stochastic Depth , Huang et al., 2016 “We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function.” x y Think of layers more like vector fields, nudging the input to the label
Wide Residual Networks , Zagoruyko and Komodakis, 2016 - wide networks with only 16 layers can significantly outperform 1000-layer deep networks - main power of residual networks is in residual blocks, and not in extreme depth - wide residual networks are several times faster to train Swapout: Learning an ensemble of deep architectures , Singh et al., 2016 - 32 layer wider model performs similar to a 1001 layer ResNet model FractalNet: Ultra-Deep Neural Networks without Residuals, Larsson et al. 2016
Still an active area of research... Densely Connected Convolutional Networks, Huang et al. ResNet in ResNet , Targ et al. Deeply-Fused Nets, Wang et al. Weighted Residuals for Very Deep Networks, Shen et al. Residual Networks of Residual Networks: Multilevel Residual Networks , Zhang et al. ... In large part likely due to open source code available, e.g.:
ASIDE: arxiv-sanity.com plug
Addressing other tasks...
Addressing other tasks... CNN features image 7x7x512 224x224x3 A block of compute with a few million parameters.
Addressing other tasks... predicted thing CNN features image 7x7x512 desired thing 224x224x3 A block of compute with a few million parameters.
Addressing other tasks... this part changes from task to task predicted thing CNN features image 7x7x512 desired thing 224x224x3 A block of compute with a few million parameters.
Image Classification thing = a vector of probabilities for different classes fully connected layer CNN image features 7x7x512 224x224x3 e.g. vector of 1000 numbers giving probabilities for different classes.
Image Captioning RNN CNN image features 7x7x512 224x224x3 A sequence of 10,000-dimensional vectors giving probabilities of different words in the caption.
Localization Class probabilities (as before) fully connected layer CNN image features 7x7x512 224x224x3 4 numbers: - X coord - Y coord - Width - Height
Reinforcement Learning Mnih et al. 2015 fully connected CNN image features 160x210x3 e.g. vector of 8 numbers giving probability of wanting to take any of the 8 possible ATARI actions.
image class “map” Segmentation deconv layers CNN image features 7x7x512 224x224x20 224x224x3 array of class probabilities at each pixel.
Autoencoders deconv layers CNN image features 7x7x512 224x224x3 224x224x3 original image
Variational Autoencoders reparameterization layer deconv layers CNN image features 7x7x512 224x224x3 224x224x3 original image [Kingma et al.], [Rezende et al.], [Salimans et al.]
Detection 1x1 CONV CNN image features 7x7x512 7x7x(5*B+C) 224x224x3 For each of 7x7 locations: - [x,y,width,height,confidence]*B - class E.g. YOLO: You Only Look Once (Demo: http://pjreddie.com/darknet/yolo/)
Dense Image Captioning 1x1 CONV CNN image features 7x7x512 7x7x(5*B+[C,..]) 224x224x3 For each of 7x7 locations: - x,y,width,height,confidence - sequence of words DenseCap: Fully Convolutional Localization Networks for Dense Captioning, Johnson et al. 2016
Practical considerations when applying ConvNets
What hardware do I use? Buy your own machine: - NVIDIA DIGITS DevBox (TITAN X GPUs) - NVIDIA DGX-1 (P100 GPUs) Build your own machine: https://graphific.github.io/posts/building-a-deep-learning-dream-machine/ GPUs in the cloud: - Amazon AWS (GRID K520 :( ) - Microsoft Azure (soon); 4x K80 GPUs - Cirrascale (“rent-a-box”)
What framework do I use? Lasagne Caffe Torch Theano TensorFlow Keras Mxnet chainer Nervana’s Neon Microsoft’s CNTK Deeplearning4j ...
What framework do I use? Lasagne Caffe Torch Theano TensorFlow Keras 1 Mxnet chainer 2,3 Nervana’s Neon Microsoft’s CNTK Deeplearning4j ...
Q: How do I know what architecture to use?
Q: How do I know what architecture to use? A: don’t be a hero. 1. Take whatever works best on ILSVRC (latest ResNet) 2. Download a pretrained model 3. Potentially add/delete some parts of it 4. Finetune it on your application.
Q: How do I know what hyperparameters to use?
Q: How do I know what hyperparameters to use? A: don’t be a hero. - Use whatever is reported to work best on ILSVRC. - Play with the regularization strength (dropout rates)
ConvNets in practice: Distributed training VGG: ~ 2-3 weeks training with 4 GPUs ResNet 101: 2-3 weeks with 4 GPUs ~$1K each
Recommend
More recommend