Return of the Devil in the Details: Delving Deep into Convolutional Nets Ken Chatfield - Karen Simonyan - Andrea Vedaldi - Andrew Zisserman University of Oxford
The Devil is still in the Details 2011 2014
Comparing Apples to Apples: State-of-the-art back in 2011 Back in 2011, state-of-the-art image classification pipelines were commonly based on the bag of visual words approach, with highly tuned feature encoders � � LLC SV IFV � Improved Locality Constrained Super-Vector Fisher Vector Linear Coding Encoding � � There were many feature encodings for this being proposed, but it was difficult to tell which worked best 3
Comparing Apples to Apples: State-of-the-art back in 2011 In our previous work (BMVC 2011) we conducted an extensive evaluation of these encodings comparing them all on a common-ground: � IFV Fixed Fixed Input Fixed � Feature LLC Evaluation Dataset Learning Extractor Protocol � SV � * we’ll call the features from these encodings shallow to distinguish them from the CNN-based features which follow 4
What’s Changed? State-of-the-art in 2014 • Introduction of CNN-based deep visual features to the community, all using pre-trained networks (Krizhevsky et al. 2012, Donahue et al. 2013, Oquab et al. 2014, Sermanet et al. 2014) • Have shown to perform excellently over standard classification and detection benchmarks • Unclear how the different methods introduced recently compare to each other, and to shallow methods such as IFV 5
Comparing Apples to Apples: State-of-the-art in 2014 • This work is again about comparing the latest methods on a common ground • We compare both different pre-trained network architectures and different learning heuristics CNN Arch 1 Fixed Fixed CNN Input Evaluation Learning Arch 2 Dataset Protocol … IFV 6
Performance Evolution over VOC2007 2008 2010 ... 2013 2014 82.42 82 80.13 80 78 77.15 76 74 73.41 72 • Our best CNN method 70 mAP achieves state-of-the-art 68.02 68 performance over several 66 datasets 64.36 64 62 61.69 • How do we get there? 60 through comparison on equal 58 footing, we determine what’s 56 54.48 important and what’s not 54 Method BOW IFV-BL IFV IFV DeCAF CNN-F CNN-M 2K CNN-S Dim. 32K 327K 84K 84K 4K 4K 2K 4K (TN) Aug. – – – f s t t f s f s f s CNN-based methods 7
1 2 Augmentation 3 4 5 Outline Study Introduction and Evaluation Setup Different pre-trained networks 1 Data augmentation (for both CNN and IFV) 2 Dataset fine-tuning 3 Reducing CNN final layer output dimensionality 4 5 Colour and CNN / IFV 8
Evaluation Setup Pre-trained Net on 1,000 ImageNet Classes training set classifier output train CNN Feature Extractor SVM Classifier (4096-D feature vector out) test test set Evaluate using mAP , accuracy etc. 9
1 Nets 2 3 4 5 Pre-trained Networks • CNN-F similar to Krizhevsky et al., NIPS 2012: ‘ImageNet classification with deep convolutional networks’ conv4 � conv5 � fc7 � conv1 � conv2 � conv3 � 4096 d.o. fc6 � 256x3x3 256x3x3 � 64x11x11 256x5x5 256x3x3 4096 stride 4 stride 1 stride 1 drop-out • CNN-M similar to Zeiler and Fergus, CoRR 2013: ‘Visualising and understanding convolutional networks’ conv4 � conv5 � fc7 � conv1 � conv2 � conv3 � 4096 d.o. fc6 � � 512x3x3 512x3x3 96x7x7 256x5x5 512x3x3 4096 stride 2 stride 2 stride 1 drop-out • CNN-S similar to OverFeat ‘accurate’ network, ICLR 2014: ‘OverFeat: integrated recognition, localisation and detection using ConvNets' conv4 � conv5 � fc7 � conv1 � conv2 � conv3 � 4096 d.o. fc6 � 512x3x3 512x3x3 96x7x7 512x3x3 4096 256x5x5 stride 2 stride 1 drop-out stride 1 10
1 Nets 2 3 4 5 Pre-trained Networks 79.89 79.74 80 78.25 mAP ( VOC07 ) 77.38 76.5 74.75 73.41 73 Decaf CNN-F CNN-M CNN-S 11
1 2 Augmentation 3 4 5 Outline Study Introduction and Evaluation Setup Different pre-trained networks 1 Data augmentation (for both CNN and IFV) 2 Dataset fine-tuning 3 Reducing CNN final layer output dimensionality 4 5 Colour and CNN / IFV 12
1 2 Augmentation 3 4 5 Data Augmentation What do we mean by data augmentation? Network Pre-training Pre-trained Network (with jittering) CNN Feature Extractor a. Extract crops b. Pool features (average, max) 13
1 2 Augmentation 3 4 5 Data Augmentation a. No augmentation (= 1 image) 224x224 b. Flip augmentation (= 2 images) 224x224 + c. Crop+Flip augmentation (= 10 images) 224x224 + flips 14
1 2 Augmentation 3 4 5 Data Augmentation None Flip Crop+Flip (train pooling: sum, test pooling: sum) Crop+Flip (train pooling: none, test pooling: sum) 79.89 79.44 80 76.97 76.99 75 mAP ( VOC07 ) 70 67.17 66.68 64.36 64.35 65 60 IFV CNN-M 15
1 2 3 Fine-tuning 4 5 Outline Study Introduction and Evaluation Setup Different pre-trained networks 1 Data augmentation (for both CNN and IFV) 2 Dataset fine-tuning 3 Reducing CNN final layer output dimensionality 4 5 Colour and CNN / IFV 16
1 2 3 Fine-tuning 4 5 Fine-tuning Network Pre-training General-purpose Pre-trained Network images from ILSVRC-2012 Features Network Fine-tuning Dataset-specific Fine-tuned Network images from target dataset Features For VOC 2007, the following loss functions were evaluated for the final fully connected layer: • TN-CLS – classification loss max{ 0, 1 - y w T φ ( I ) } • TN-RNK – ranking loss max{ 0, 1 - w T ( φ ( I POS ) - φ ( I NEG ) ) } 17
1 2 3 Fine-tuning 4 5 Fine-tuning 83 82.4 82 mAP ( VOC07 ) 81 80 79.7 79 No TN TN-RNK CNN-S 18
1 2 3 4 Output Dim 5 Outline Study Introduction and Evaluation Setup Different pre-trained networks 1 Data augmentation (for both CNN and IFV) 2 Dataset fine-tuning 3 Reducing CNN final layer output dimensionality 4 5 Colour and CNN / IFV 19
1 2 3 4 Output Dim 5 Low Dimensional CNN Features • Baseline networks all have 4096-D last hidden layer • We further trained three modifications to CNN-M with lower dimensional full7 layers conv4 � conv5 � conv1 � conv3 � fc7 � conv2 � 4096 d.o. fc6 � 512x3x3 512x3x3 96x7x7 256x5x5 512x3x3 4096 st. 2 st. 2, pad 1 st. 1, pad 1 drop-out 2048 * Note: as only the original ILSVRC-2012 data was used for re-training this differs from fine-tuning 1024 and is simply a way of reducing the final output dimension 128 20
1 2 3 4 Output Dim 5 Low Dimensional CNN Features 81 80.1 80.25 79.91 mAP ( VOC07 ) 79.89 79.5 78.6 78.75 78 4096 2048 1024 128 CNN-M 21
1 2 3 4 5 IFV Exts. Outline Study Introduction and Evaluation Setup Different pre-trained networks 1 Data augmentation (for both CNN and IFV) 2 Dataset fine-tuning 3 Reducing CNN final layer output dimensionality 4 5 Colour and CNN / IFV 22
1 2 3 4 5 IFV Exts. Impact of Colour Greyscale Colour Greyscale+aug Colour+aug 79.89 80 76.97 77 75 73.59 mAP ( VOC07 ) 70 68.02 67.93 66.37 65.36 65 60 IFV-512 CNN-M 23
Comparison to State-of-the-art ILSVRC-2012 VOC2007 VOC2012 CNN-M 2048 13.5 80.1 82.4 CNN-S 13.1 79.7 82.9 13.1 CNN-S TUNE-RNK 82.4 83.2 16.1 79.0 Zeiler & Fergus Oquab et al. 18.0 77.7 78.7 ( 82.8 *) Oquab et al. 86.3 * Wei et al. 81.5 ( 85.2 * ) 81.7 ( 90.3 * ) * Uses extended training data and/or fusion with other methods 24
Take Home Messages • CNN-based methods >> shallow methods • We can transfer tricks from deep features to shallow features • We can achieve incredibly low dimensional (~128-D) but performant features with CNN-based methods • If you get the details right, it’s possible to get to state- of-the-art with very simple methods 25
There’s more… • Presented here was just a subset of the full results from the paper • Check out the paper for full results on: • VOC 2007 • VOC 2012 • Caltech-101 • Caltech-256 • ILSVRC-2012 26
One more thing… • CNN models and feature computation code can now be downloaded from the project website: http://www.robots.ox.ac.uk/~vgg/software/deep_eval/ • As before, source code to reproduce all experiments will be made available 27
Questions?
Recommend
More recommend