for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman - PowerPoint PPT Presentation

Very Deep ConvNets for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman Visual Geometry Group, University of Oxford ILSVRC Workshop 12 September 2014

2 Summary of VGG Submission • Localisation task • 1 st place, 25.3% error • Classification task • 2 nd place, 7.3% error • Key component: very deep ConvNets • up to 19 weight layers

3 Effect of Depth • How does ConvNet depth affect the performance? • Comparison of ConvNets • same generic design – fair evaluation • increasing depth • from 11 to 19 weight layers

4 image Network Design conv-64 conv-64 maxpool Key design choices: conv-128 conv-128 • 3x3 conv. kernels – very small maxpool • conv. stride 1 – no loss of information conv-256 conv-256 maxpool conv-512 Other details: conv-512 maxpool • Rectification (ReLU) non-linearity conv-512 • 5 max-pool layers (x2 reduction) conv-512 maxpool • no normalisation FC-4096 • 3 fully-connected (FC) layers FC-4096 FC-1000 softmax

5 Discussion Why 3x3 layers? • Stacked conv. layers have a large receptive field • two 3x3 layers – 5x5 receptive field 5 • three 3x3 layers – 7x7 receptive field • More non-linearity 5 • Less parameters to learn • ~140M per net 1 st 3x3 conv. layer 2 nd 3x3 conv. layer

6 Training • Solver • multinomial logistic regression • mini-batch gradient descent with momentum • dropout and weight decay regularisation • fast convergence (74 training epochs) • Initialisation • large number of ReLU layers – prone to stalling • most shallow net (11 layers) uses Gaussian initialisation • deeper nets • top 4 conv. and FC layers initialised with 11 layer net • other layers – random Gaussian

7 Training (2) • Multi-scale training • randomly-cropped ConvNet input 256 224 • fixed-size 224x224 N≥256 • different training image size • 256xN • 384xN • [256;512]xN – random image size 384 224 (scale jittering) • Standard jittering • random horizontal flips N≥384 • random RGB shift

8 Testing • Dense application over the whole image • FC layers converted to conv. layers • sum-pooling of class score maps image • more efficient than applying the net to multiple crops conv. • Jittering layers • multiple image sizes: 256xN, 384xN, etc. class score map • horizontal flips • class scores averaged pooling class scores

9 Implementation • Heavily-modified Caffe C++ toolbox • Multiple GPU support • 4 x NVIDIA Titan, off-the-shelf workstation • data parallelism for training and testing • ~3.75 times speed-up, 2-3 weeks for training image batch

10 Comparison – Fixed Training Size Top-5 Classification Error (Val. Set) 9.5 9.4 9.3 9 9 training image smallest side 8.8 8.7 8.7 8.5 256 better 384 8 [256;512] 7.5 7 13 layers 16 layers 19 layers • 16 or 19 layers trained on 384xN images are the best

11 Comparison – Random Training Size Top-5 Classification Error (Val. Set) 9.5 9.4 9.3 9 9 training image smallest side 8.8 8.7 8.7 8.5 256 better 384 8 8 [256;512] 7.5 7 13 layers 16 layers 19 layers • Training scale jittering is better than fixed scales • Before submission: single net, FC-layers tuning

12 Comparison – Random Training Size Top-5 Classification Error (Val. Set) 9.5 9.4 9.3 9 9 training image smallest side 8.8 8.7 8.7 8.5 256 better 384 8.2 8 [256;512] 7.5 7.6 7.5 7 13 layers 16 layers 19 layers • Training scale jittering is better than fixed scales • After submission: three nets, all-layers tuning

13 Final Results Top-5 Classification Error (Test Set) 12 12.5 11.7 11 10 better 9 9.1 8 multiple nets 8.4 8.1 7.9 7 single net 7.3 6.7 6 • 2 nd place with 7.3% error • combination of 7 models: 6 fixed-scale, 1 multi-scale • Single model: 8.4% error

14 Final Results (Post-Competition) Top-5 Classification Error (Test Set) 12 12.5 11.7 11 10 better 9 9.1 8 multiple nets 8.4 8.1 7.9 7 single net 7.3 7.3 7 6.7 6 • 2 nd place with 7.0% error • combination of two multi-scale models (16- and 19-layer) • Single model: 7.3% error

15 Localisation Our localisation method • Builds on very deep classification ConvNets • Similar to OverFeat 1. Localisation ConvNet predicts a set of bounding boxes 2. Bounding boxes are merged 3. Resulting boxes are scored by a classification ConvNet

16 Localisation (2) • Last layer predicts a bbox for each class • Bbox parameterisation: (x,y,w,h) 224x224 • 1000 classes x 4-D / class = 4000-D crop 0 bbox • Training • Euclidean loss • initialised with a classification net • fine-tuning of all layers

17 Final Results Top-5 Localisation Error (Test Set) 32 31.9 31 30 29.9 29 better 28 27 26 26.4 25 25.3 24 • 1 st place with 25.3% error • combination of 2 localisation models

18 Summary • Excellent results using classical ConvNets • small receptive fields • but very deep → lots of non -linearity • Depth matters! • Details in the arXiv pre-print: arxiv.org/pdf/1409.1556/ VGG Team ILSVRC Progress 30 27 20 10 15.2 7 0 2012 2013 2014 We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.

for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman - PowerPoint PPT Presentation

Very Deep ConvNets for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman Visual Geometry Group, University of Oxford ILSVRC Workshop 12 September 2014 2 Summary of VGG Submission Localisation task 1 st place, 25.3% error

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Instance-level recognition 1) Local invariant features 2) Matching and recognition with local

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Image Retrieval with CNN Giorgos Tolias Visual Recognition Group, CTU in Prague CVPR 2017

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

CSSE463: Image Recognition CSSE463: Image Recognition Day 2 Day 2 Roll call Roll call

EMPLOYEE RECOGNITION OBJECTIVES Types of recognition Creating a culture of recognition

License Plate Recognition License Plate Recognition License Plate Recognition License Plate

Instance-level Recognition Pingmei Xu Object Recognition Friends SE01EP02 Recognition: Find the

Face detection and recognition Detection Recognition Sally Face detection &

Prone to Fail The Pre-Crisis Financial System Darrell Duffie GSB Stanford The Financial Crisis

Seek Novelty Personality Environment Predictable Unpredictable Seek Stability Seek Novelty

EVOLVABLE HARDWARE: The Darwin Chip Dream or Reality? Presented by: Deyasini Majumdar 1

Hermes The Problem 44 million runners just in U.S.A Poor running form leads to injury

5/30/2014 Yielding Positions Prone positioning improves VQ To Prone or Not to mismatch

Extension Breakdown: Security Analysis of Browsers Extension Resources Control Policies

Best Practices for the Consolidated Plan and Action Plan May 2019 Housekeeping Logistics:

EFFICIENT SYMBOL-LEVEL TRANSMISSION IN ERROR- PRONE WIRELESS NETWORKS Pouya Ostovari, Jie Wu,

for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman - PowerPoint PPT Presentation

Very Deep ConvNets for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman Visual Geometry Group, University of Oxford ILSVRC Workshop 12 September 2014 2 Summary of VGG Submission Localisation task 1 st place, 25.3% error

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Instance-level recognition 1) Local invariant features 2) Matching and recognition with local

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Image Retrieval with CNN Giorgos Tolias Visual Recognition Group, CTU in Prague CVPR 2017

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

CSSE463: Image Recognition CSSE463: Image Recognition Day 2 Day 2 Roll call Roll call

EMPLOYEE RECOGNITION OBJECTIVES Types of recognition Creating a culture of recognition

License Plate Recognition License Plate Recognition License Plate Recognition License Plate

Instance-level Recognition Pingmei Xu Object Recognition Friends SE01EP02 Recognition: Find the

Face detection and recognition Detection Recognition Sally Face detection &amp;

Prone to Fail The Pre-Crisis Financial System Darrell Duffie GSB Stanford The Financial Crisis

Seek Novelty Personality Environment Predictable Unpredictable Seek Stability Seek Novelty

EVOLVABLE HARDWARE: The Darwin Chip Dream or Reality? Presented by: Deyasini Majumdar 1

Hermes The Problem 44 million runners just in U.S.A Poor running form leads to injury

5/30/2014 Yielding Positions Prone positioning improves VQ To Prone or Not to mismatch

Extension Breakdown: Security Analysis of Browsers Extension Resources Control Policies

Best Practices for the Consolidated Plan and Action Plan May 2019 Housekeeping Logistics:

EFFICIENT SYMBOL-LEVEL TRANSMISSION IN ERROR- PRONE WIRELESS NETWORKS Pouya Ostovari, Jie Wu,

Face detection and recognition Detection Recognition Sally Face detection &