advanced section 3 cnns and object detection
play

Advanced Section #3: CNNs and Object Detection AC 209B: Data Science - PowerPoint PPT Presentation

Advanced Section #3: CNNs and Object Detection AC 209B: Data Science Javier Zazo Pavlos Protopapas Lecture Outline Convnets review Classic Networks Residual networks Other combination blocks Object recognition systems Face recognition


  1. Advanced Section #3: CNNs and Object Detection AC 209B: Data Science Javier Zazo Pavlos Protopapas

  2. Lecture Outline Convnets review Classic Networks Residual networks Other combination blocks Object recognition systems Face recognition systems 2

  3. Convnets review 3

  4. Motivation for convnets ◮ Less parameters (weights) than a FC network. ◮ Invariant to object translation. ◮ Can tolerate some distortion in the images. ◮ Capable of generalizing and learning features. ◮ Require grid input. Source: http://cs231n.github.io/ 4

  5. CNN layers ◮ Convolutional layer: formed by filters , feature maps , and activation functions . – Convolution can be full , same or valid . � n input − f + 2 p � n output = + 1 . s ◮ Pooling layers: reduces overfitting. ◮ Fully connected layers: mix spacial and channel features together. Source: http://cs231n.github.io/ 5

  6. Introductory convolutional network example Fully connected Convolutional layer Max-pooling 1 0 c h a n n e l s 1 0 c h a n n e l s sigmoid or Input softmax 32x32x1 f = 2 f = 5 s = 2 s = 1 p = 0 p = 0 2 0 0 n e u r o n s 28x28x10 14x14x10 ◮ Training parameters: – 250 weights on the conv. filter + 10 bias terms. – 0 weights on the max-pool. – 13 × 13 × 10 = 1 , 690 output elements after max-pool. – 1 , 690 × 200 = 338 , 000 weights + 200 bias in the FC layer. – Total: 338,460 parameters to be trained. 6

  7. Classic Networks 7

  8. LeNet-5 ◮ Formulation is a bit outdated considering current practices. ◮ Uses convolutional networks followed by pooling layers and finishes with fully connected layers. ◮ Starts with high dimensional features and reduces their size while increasing the number of channels. ◮ Around 60k parameters. conv. layer a v g p o o l conv. layer a v g p o o l f = 2 f = 5 f = 5 f = 2 s = 2 s = 1 s = 1 s = 2 32x32x1 2 8 x 2 8 x 6 1 4 x 1 4 x 6 1 0 x 1 0 x 1 6 5 x 5 x 1 6 8 4 1 2 0 Yann LeCun, L´ eon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998. 8

  9. AlexNet ◮ 1.2 million high-resolution (227x227x3) images in the ImageNet 2010 contest; ◮ 1000 different classes; NN with 60 million parameters to optimize ( ∼ 255 MB); ◮ Uses ReLu activation functions;. GPUs for training; 12 layers. c o n v . l a y e r ma x - p o o l ma x - p o o l c o n v . l a y e r f = 1 1 f = 3 f = 5 f = 3 s = 4 s = 2 s a me s = 2 5 5 x 5 5 x9 6 2 7 x 2 7 x 9 6 2 7 x 2 7 x 2 5 6 1 3 x 1 3 x 2 5 6 2 2 7 x 2 2 7 x 3 c o n v . l a y e r c o n v . l a y e r c o n v . l a y e r ma x - p o o l S o f t ma x 1 0 0 0 = f = 3 f = 3 f = 3 s = 1 s = 2 s = 1 1 3 x 1 3 x 3 8 4 1 3 x 1 3 x 3 8 4 1 3 x 1 3 x 2 5 6 6 x 6 x 2 5 6 9 2 1 6 4 0 9 6 4 0 9 6 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural 9 networks,” in Advances in neural information processing systems , pp. 1097–1105, 2012

  10. VGG-16 and VGG-19 ◮ ImageNet Challenge 2014; 16 or 19 layers; 138 million parameters (522 MB). ◮ Convolutional layers use ‘same’ padding and stride s = 1. ◮ Max-pooling layers use a filter size f = 2 and stride s = 2. C O N V = 3 x 3 fj l t e r , s = 1 , s a me MA X - P O O L = 2 x 2 fj l t e r , s = 2 2 2 4 x 2 2 4 x 6 4 1 1 2 x 1 1 2 x 6 4 1 1 2 x 1 1 2 x 1 2 8 5 6 x 5 6 x 1 2 8 POOL POOL [CONV 64] [CONV 128] x2 x2 2 2 4 x 2 2 4 x 3 5 6 x 5 6 x 2 5 6 2 8 x 2 8 x 2 5 6 [CONV 512] 1 4 x 1 4 x 5 1 2 2 8 x 2 8 x 5 1 2 [CONV 256] POOL POOL 3 x3 F C S o f t ma x F C 1 4 x 1 4 x 5 1 2 7 x 7 x 5 1 2 [CONV 512] POOL 4 0 9 6 4 0 9 6 1 0 0 0 x3 10 Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014.

  11. Residual networks 11

  12. Residual block ◮ Residual nets appeared in 2016 to train very deep NN (100 or more layers). ◮ Their architecture uses ‘residual blocks’. ◮ Plain network structure: a [ l ] z [ l +1] a [ l +1] z [ l +2] a [ l +2] ReLu ReLu linear linear ◮ Residual network block : identity + a [ l ] z [ l +1] a [ l +1] z [ l +2] a [ l +2] linear ReLu linear ReLu Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2016. 12

  13. Equations of the residual block ◮ Plain network: ◮ Residual block: a [ l ] = g ( z [ l ] ) a [ l ] = g ( z [ l ] ) z [ l +1] = W [ l +1] a [ l ] + b [ l +1] z [ l +1] = W [ l +1] a [ l ] + b [ l +1] a [ l +1] = g ( z [ l +1] ) a [ l +1] = g ( z [ l +1] ) z [ l +2] = W [ l +2] a [ l +1] + b [ l +2] z [ l +2] = W [ l +2] a [ l +1] + b [ l +2] a [ l +2] = g ( z [ l +2] ) a [ l +2] = g ( z [ l +2] + a [ l ] ) ◮ With this extra connection gradients can travel backwards more easily. ◮ The residual block can very easily learn the identity function by setting W [ l +2] = 0 and b [ l +2] = 0. ◮ In such case, a [ l +2] = g ( a [ l ] ) = a [ l ] for ReLu units. – It becomes a flexible block that can expand the capacity of the network, or simply transform into a identity function that would not affect training. 13

  14. Residual network ◮ A residual network stacks residual blocks sequentially. ◮ The idea is to allow the network to become deeper without increasing the training complexity. P l a i n R e s N e t r r o o r r r r e e g g n n i i n n “practice” i i a a r r t t “theory” # l a y e r s # l a y e r s 14

  15. Residual network ◮ Residual networks implement blocks with convolutional layers that use ‘same’ padding option (even when max-pooling). – This allows the block to learn the identity function. ◮ The designer may want to reduce the size of features and use ‘valid’ padding. – In such case, the shortcut path can implement a new set of convolutional layers that reduces the size appropriately. 15

  16. Residual network 34 layer example VGG-19 34-layer plain 34-layer residual im ag e im ag e im ag e output 3x 3 conv, 64 size: 224 3x 3 conv, 64 pool, /2 output size: 112 3x 3 conv, 128 3x 3 conv, 128 7x 7 conv, 64, /2 7x 7 conv, 64, /2 pool, /2 pool, /2 pool, /2 output size: 56 3x 3 conv, 256 3x 3 conv , 64 3x 3 conv , 64 3x 3 conv, 256 3x 3 conv , 64 3x 3 conv , 64 3x 3 conv, 256 3x 3 conv , 64 3x 3 conv , 64 3x 3 conv, 256 3x 3 conv , 64 3x 3 conv , 64 3x 3 conv , 64 3x 3 conv , 64 3x 3 conv , 64 3x 3 conv , 64 pool, /2 3x 3 conv , 128, /2 3x3 conv , 128, /2 output size:28 3x 3 conv, 512 3x 3 conv, 128 3x 3 conv, 128 3x 3 conv, 512 3x 3 conv, 128 3x 3 conv, 128 3x 3 conv, 512 3x 3 conv, 128 3x 3 conv, 128 3x 3 conv, 512 3x 3 conv, 128 3x 3 conv, 128 3x 3 conv, 128 3x 3 conv, 128 3x 3 conv, 128 3x 3 conv, 128 3x 3 conv, 128 3x 3 conv, 128 output pool, /2 3x 3 conv , 256, /2 3x3 conv , 256, /2 size: 14 3x 3 conv, 512 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 512 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 512 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 512 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 3x 3 conv, 256 Source: He2016 output pool, /2 3x 3 conv , 512, /2 3x3 conv , 512, /2 size: 7 3x 3 conv, 512 3x 3 conv, 512 3x 3 conv, 512 3x 3 conv, 512 3x 3 conv, 512 3x 3 conv, 512 3x 3 conv, 512 3x 3 conv, 512 3x 3 conv, 512 3x 3 conv, 512 output fc 4096 avg pool avg pool size: 1 fc 4096 fc 1000 fc 1000 fc 1000 16

  17. Classification error values on Imagenet ◮ Alexnet (2012) achieved a top-5 error of 15.3% (second place was 26.2%). ◮ ZFNet (2013) achieved a top-5 error of 14.8% (visualization of features). method top-1 err. top-5 err. 8.43 † VGG [40] (ILSVRC’14) - GoogLeNet [43] (ILSVRC’14) - 7.89 VGG [40] (v5) 24.4 7.1 PReLU-net [12] 21.59 5.71 BN-inception [16] 21.99 5.81 ResNet-34 B 21.84 5.71 ResNet-34 C 21.53 5.60 ResNet-50 20.74 5.25 ResNet-101 19.87 4.60 ResNet-152 19.38 4.49 17

  18. Dense Networks ◮ Goal: allow maximum information (and gradient) flow − → connect every layer directly with each other. ◮ DenseNets exploit the potential of the network through feature reuse − → no need to learn redundant feature maps. ◮ DenseNets layers are very narrow (e.g. 12 filters), and they just add a small set of new feature-maps. 18

  19. Dense Networks II ◮ DenseNets do not sum the output feature maps of the layer with the incoming feature maps but concatenate them: a [ l ] = g ([ a [0] , a [1] , . . . , a [ l − 1] ]) ◮ D imensions of the feature maps remains constant within a block, but the number of filters changes between them − → growth rate : k [ l ] = k [0] + k · ( l − 1) 19

  20. Dense Networks III: Full architecture 20

  21. Other combination blocks 21

Recommend


More recommend