deep residual learning for image recognition
play

Deep Residual Learning for Image Recognition ILSVRC 2015 MS COCO - PowerPoint PPT Presentation

Deep Residual Learning for Image Recognition ILSVRC 2015 MS COCO 2015 K. He, X. Zhang, S. Ren and J. Sun WINNER Microsoft Research Article overview by Ilya Kuzovkin Computational Neuroscience Seminar University of Tartu 2016 T HE I DEA


  1. Deep Residual Learning for Image Recognition ILSVRC 2015 MS COCO 2015 K. He, X. Zhang, S. Ren and J. Sun WINNER Microsoft Research Article overview by Ilya Kuzovkin Computational Neuroscience Seminar University of Tartu 2016

  2. T HE I DEA

  3. 1000 classes

  4. 2012 8 layers 15.31% error

  5. 2012 2013 8 layers 9 layers, 2x params 15.31% error 11.74% error

  6. 2012 2013 2014 8 layers 9 layers, 2x params 19 layers 15.31% error 11.74% error 7.41% error

  7. 2012 2013 2014 2015 ? 8 layers 9 layers, 2x params 19 layers 15.31% error 11.74% error 7.41% error

  8. 2012 2013 2014 2015 Is learning better networks as easy as stacking more layers ? ? 8 layers 9 layers, 2x params 19 layers 15.31% error 11.74% error 7.41% error

  9. 2012 2013 2014 2015 Is learning better networks as easy as stacking more layers ? Vanishing / exploding gradients ? 8 layers 9 layers, 2x params 19 layers 15.31% error 11.74% error 7.41% error

  10. 2012 2013 2014 2015 Is learning better networks as easy as stacking more layers ? Vanishing / exploding gradients ? Normalized initialization & intermediate normalization 8 layers 9 layers, 2x params 19 layers 15.31% error 11.74% error 7.41% error

  11. 2012 2013 2014 2015 Is learning better networks as easy as stacking more layers ? Vanishing / exploding gradients ? Normalized initialization & intermediate normalization Degradation problem 8 layers 9 layers, 2x params 19 layers 15.31% error 11.74% error 7.41% error

  12. Degradation problem “with the network depth increasing, accuracy gets saturated”

  13. Degradation problem “with the network depth increasing, accuracy gets saturated” Not caused by overfitting:

  14. Conv Conv Conv Conv Trained Tested Accuracy X%

  15. Conv Conv Conv Conv Conv Conv Conv Conv Identity Trained Identity Identity Tested Identity Tested Accuracy X%

  16. Conv Conv Conv Conv Conv Conv Conv Conv Identity Trained Identity Identity Tested Identity Tested Same Accuracy X% performance

  17. Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Identity Conv Trained Identity Conv Identity Conv Tested Identity Conv Tested Trained Tested Same Accuracy X% performance

  18. Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Identity Conv Trained Identity Conv Identity Conv Tested Identity Conv Tested Trained Tested Same Worse! Accuracy X% performance

  19. Conv Conv Conv “Our current solvers on hand are unable to Conv Conv Conv find solutions that are comparably good or Conv Conv Conv better than the constructed solution Conv Conv Conv (or unable to do so in feasible time)” Identity Conv Trained Identity Conv Identity Conv Tested Identity Conv Tested Trained Tested Same Worse! Accuracy X% performance

  20. Conv Conv Conv “Our current solvers on hand are unable to Conv Conv Conv find solutions that are comparably good or Conv Conv Conv better than the constructed solution Conv Conv Conv (or unable to do so in feasible time)” Identity Conv Trained Identity Conv “Solvers might have difficulties in Identity Conv Tested approximating identity mappings by Identity Conv multiple nonlinear layers” Tested Trained Tested Same Worse! Accuracy X% performance

  21. Conv Conv Conv “Our current solvers on hand are unable to Conv Conv Conv find solutions that are comparably good or Conv Conv Conv better than the constructed solution Conv Conv Conv (or unable to do so in feasible time)” Identity Conv Trained Identity Conv “Solvers might have difficulties in Identity Conv Tested approximating identity mappings by Identity Conv multiple nonlinear layers” Tested Trained Tested Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero ” Same Worse! Accuracy X% performance

  22. Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero ” is the true function we want to learn

  23. Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero ” is the true function we want to learn Let’s pretend we want to learn instead.

  24. Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero ” is the true function we want to learn Let’s pretend we want to learn instead. The original function is then

  25. Network can decide how deep it needs to be…

  26. Network can decide how deep it needs to be… “The identity connections introduce neither extra parameter nor computation complexity”

  27. 2012 2013 2014 2015 ? 8 layers 9 layers, 2x params 19 layers 15.31% error 11.74% error 7.41% error

  28. 2012 2013 2014 2015 8 layers 9 layers, 2x params 19 layers 152 layers 15.31% error 11.74% error 7.41% error 3.57% error

  29. E XPERIMENTS AND D ETAILS

  30. • Lots of convolutional 3x3 layers • VGG complexity is 19.6 billion FLOPs 34-layer-ResNet is 3.6 bln. FLOPs

  31. • Lots of convolutional 3x3 layers • VGG complexity is 19.6 billion FLOPs 34-layer-ResNet is 3.6 bln. FLOPs � • Batch normalization • SGD with batch size 256 • (up to) 600,000 iterations • LR 0.1 (divided by 10 when error plateaus) • Momentum 0.9 • No dropout • Weight decay 0.0001

  32. • Lots of convolutional 3x3 layers • VGG complexity is 19.6 billion FLOPs 34-layer-ResNet is 3.6 bln. FLOPs � • Batch normalization • SGD with batch size 256 • (up to) 600,000 iterations • LR 0.1 (divided by 10 when error plateaus) • Momentum 0.9 • No dropout • Weight decay 0.0001 � • 1.28 million training images • 50,000 validation • 100,000 test

  33. • 34-layer ResNet has lower training error . This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth .

  34. • 34-layer ResNet has lower training error . This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth . � • 34-layer-ResNet reduces the top-1 error by 3.5%

  35. • 34-layer ResNet has lower training error . This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth . � • 34-layer-ResNet reduces the top-1 error by 3.5% � • 18-layer ResNet converges faster and thus ResNet eases the optimization by providing faster convergence at the early stage.

  36. G OING D EEPER

  37. Due to time complexity the usual building block is replaced by Bottleneck Block 50 / 101 / 152 - layer ResNets are build from those blocks

  38. A NALYSIS ON CIFAR-10

  39. ImageNet Classification 2015 1st 3.57% error ImageNet Object Detection 2015 1st 194 / 200 categories ImageNet Object Localization 2015 1st 9.02% error COCO Detection 2015 1st 37.3% COCO Segmentation 2015 1st 28.2% http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf

Recommend


More recommend