Very Deep ConvNets for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman Visual Geometry Group, University of Oxford ILSVRC Workshop 12 September 2014
2 Summary of VGG Submission • Localisation task • 1 st place, 25.3% error • Classification task • 2 nd place, 7.3% error • Key component: very deep ConvNets • up to 19 weight layers
3 Effect of Depth • How does ConvNet depth affect the performance? • Comparison of ConvNets • same generic design – fair evaluation • increasing depth • from 11 to 19 weight layers
4 image Network Design conv-64 conv-64 maxpool Key design choices: conv-128 conv-128 • 3x3 conv. kernels – very small maxpool • conv. stride 1 – no loss of information conv-256 conv-256 maxpool conv-512 Other details: conv-512 maxpool • Rectification (ReLU) non-linearity conv-512 • 5 max-pool layers (x2 reduction) conv-512 maxpool • no normalisation FC-4096 • 3 fully-connected (FC) layers FC-4096 FC-1000 softmax
5 Discussion Why 3x3 layers? • Stacked conv. layers have a large receptive field • two 3x3 layers – 5x5 receptive field 5 • three 3x3 layers – 7x7 receptive field • More non-linearity 5 • Less parameters to learn • ~140M per net 1 st 3x3 conv. layer 2 nd 3x3 conv. layer
6 Training • Solver • multinomial logistic regression • mini-batch gradient descent with momentum • dropout and weight decay regularisation • fast convergence (74 training epochs) • Initialisation • large number of ReLU layers – prone to stalling • most shallow net (11 layers) uses Gaussian initialisation • deeper nets • top 4 conv. and FC layers initialised with 11 layer net • other layers – random Gaussian
7 Training (2) • Multi-scale training • randomly-cropped ConvNet input 256 224 • fixed-size 224x224 N≥256 • different training image size • 256xN • 384xN • [256;512]xN – random image size 384 224 (scale jittering) • Standard jittering • random horizontal flips N≥384 • random RGB shift
8 Testing • Dense application over the whole image • FC layers converted to conv. layers • sum-pooling of class score maps image • more efficient than applying the net to multiple crops conv. • Jittering layers • multiple image sizes: 256xN, 384xN, etc. class score map • horizontal flips • class scores averaged pooling class scores
9 Implementation • Heavily-modified Caffe C++ toolbox • Multiple GPU support • 4 x NVIDIA Titan, off-the-shelf workstation • data parallelism for training and testing • ~3.75 times speed-up, 2-3 weeks for training image batch
10 Comparison – Fixed Training Size Top-5 Classification Error (Val. Set) 9.5 9.4 9.3 9 9 training image smallest side 8.8 8.7 8.7 8.5 256 better 384 8 [256;512] 7.5 7 13 layers 16 layers 19 layers • 16 or 19 layers trained on 384xN images are the best
11 Comparison – Random Training Size Top-5 Classification Error (Val. Set) 9.5 9.4 9.3 9 9 training image smallest side 8.8 8.7 8.7 8.5 256 better 384 8 8 [256;512] 7.5 7 13 layers 16 layers 19 layers • Training scale jittering is better than fixed scales • Before submission: single net, FC-layers tuning
12 Comparison – Random Training Size Top-5 Classification Error (Val. Set) 9.5 9.4 9.3 9 9 training image smallest side 8.8 8.7 8.7 8.5 256 better 384 8.2 8 [256;512] 7.5 7.6 7.5 7 13 layers 16 layers 19 layers • Training scale jittering is better than fixed scales • After submission: three nets, all-layers tuning
13 Final Results Top-5 Classification Error (Test Set) 12 12.5 11.7 11 10 better 9 9.1 8 multiple nets 8.4 8.1 7.9 7 single net 7.3 6.7 6 • 2 nd place with 7.3% error • combination of 7 models: 6 fixed-scale, 1 multi-scale • Single model: 8.4% error
14 Final Results (Post-Competition) Top-5 Classification Error (Test Set) 12 12.5 11.7 11 10 better 9 9.1 8 multiple nets 8.4 8.1 7.9 7 single net 7.3 7.3 7 6.7 6 • 2 nd place with 7.0% error • combination of two multi-scale models (16- and 19-layer) • Single model: 7.3% error
15 Localisation Our localisation method • Builds on very deep classification ConvNets • Similar to OverFeat 1. Localisation ConvNet predicts a set of bounding boxes 2. Bounding boxes are merged 3. Resulting boxes are scored by a classification ConvNet
16 Localisation (2) • Last layer predicts a bbox for each class • Bbox parameterisation: (x,y,w,h) 224x224 • 1000 classes x 4-D / class = 4000-D crop 0 bbox • Training • Euclidean loss • initialised with a classification net • fine-tuning of all layers
17 Final Results Top-5 Localisation Error (Test Set) 32 31.9 31 30 29.9 29 better 28 27 26 26.4 25 25.3 24 • 1 st place with 25.3% error • combination of 2 localisation models
18 Summary • Excellent results using classical ConvNets • small receptive fields • but very deep → lots of non -linearity • Depth matters! • Details in the arXiv pre-print: arxiv.org/pdf/1409.1556/ VGG Team ILSVRC Progress 30 27 20 10 15.2 7 0 2012 2013 2014 We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.
Recommend
More recommend