convolutional
play

Convolutional Neural Networks QSB 2018: Learning and Artificial - PowerPoint PPT Presentation

Convolutional Neural Networks QSB 2018: Learning and Artificial intelligence Tutorial session 3 Giulio Matteucci Neural network architectures for computer vision tasks Images are high dimensional ! input x where n = nx x ny x nc


  1. Convolutional Neural Networks QSB 2018: Learning and Artificial intelligence – Tutorial session 3 Giulio Matteucci

  2. Neural network architectures for computer vision tasks Images are high dimensional ! input x ∈ 𝑺 𝒐 where n = nx x ny x nc … is large! representing them with pixel intensity values nc number of parameters (weights) grows quadratically with resolution ny fully connected networks do not scale well to real world computer vision problems! Can we exploit our prior knowledge about the the visual nx world to design a better architecture for vision?

  3. Start from two considerations about natural visual input …. natural images are made of sparse , local visual features are local 1 independent components … because … visual scenes are made of (often) repeated elements natural image statistics is (approximately) visual features can show up everywhere 2 stationary across visual space because … visual objects undergo identity preserving transformations (e.g. translation) HyvΓ€rinen et al. β€œNatural Image Statistics”, 2009

  4. neurons search for the pattern neurons as filters…. dot product measures similarity stored in their weighs in input β€œsynapse” π‘₯ 0 𝑦 0 β€œdendrite” π‘₯ 0 𝑦 0 input β€œaxon” β€œsoma” 𝑔 ෍ π‘₯ 𝑗 𝑦 𝑗 + 𝑐 π‘₯ 1 𝑦 1 𝑗 𝑔 ෍ π‘₯ 𝑗 𝑦 𝑗 + 𝑐 output β€œaxon” 𝑗 π‘₯ 2 𝑦 2 … when input vector is similar enough to weight vector: response = preferred feature is detected

  5. 1 learn small localized filters : … to do so let’s keep spatial structure (i.e. do not flatten input) fully connected units locally connected units h w  βœ“ learning global filters for local features learning local filters for local features nx x ny parameters per hidden unit h x w parameters per hidden unit Eg: nx=ny=200 ---> 40000 Eg: h=w=4 ---> 16 parameters per unit parameters per unit Costly and inefficient! Cheap and efficient!

  6. convolution operation reuse localized filters unaltered across different part of image 2 𝑳 𝑴 𝒛 𝒋,π’Œ = ෍ ෍ π’š π’‹βˆ’π’,π’Œβˆ’π’Ž 𝒙 𝒍,π’Ž 𝒍=βˆ’π‘³ π’Ž=βˆ’π‘΄

  7. nout filter output 2 1 2 nout = (nin – f) +1 applying convolution 1 0 -1 output naturally shrinks input 1 2 -1 1 -3 We can avoid this … f nin -2 2 1 2 1 adding 0s at the padded nout = (nin + 2p – f) +1 input border convolution 0 1 2 -1 1 -3 0 p p=0 | ninβ‰ nout called: β€œ valid ” convolution p | nin=nout called: β€œ same ” convolution

  8. receptive field (RF) when cascading multiple convolution operations useful to introduce: region of the input space from which a unit of interest filter given neuron receives information from 1 1 -1 1 -1 1 layer k=4 β€’ π’ˆ πŸ“ equal to filter size in first layer 1 1 1 -3 2 layer k=3 β€’ grow as of f-1 each next layer layer k=2 2 3 -2 1 -2 2 -1 for the k th one, recursively: layer k=1 -2 2 1 2 -1 -2 2 1 1 π’Ž 𝒍 = π’Ž π’βˆ’πŸ + π’ˆ 𝒍 βˆ’ 𝟐 input 0 1 2 -1 1 -3 1 2 -1 1 -3 π’Ž 𝟐 = π’ˆ 𝒍 with: π’Ž πŸ“

  9. modern CNNs use very small filters (e.g. 3x3) to develop selectivity for meaningful pattern we need larger RF ! we may want to make them grow faster … s filter -2 1 1 change the strided β€œstep” of filter 1 0 -1 convolution displacement 0 1 2 -1 1 -3 0 strided convolution also act as a downsampling π’βˆ’πŸ 𝒕 𝒋 π’Ž 𝒍 = π’Ž π’βˆ’πŸ + π’ˆ 𝒍 βˆ’ 𝟐 Ο‚ 𝒋=𝟐 in this way RF size can grow faster : greatly reducing output size s = stride …. considering stride (and padding) nout = (nin + 2p – f) +1 p = padding with the output size will be: 𝑻 f = filter dimension

  10. we are learning multiple filters , acting on all input channels together… filter output will form a β€œ feature map ” convolutional layer stack different feature maps on the third dimension (as different channels): ncout size of output volume: nxout = (nxin + 2p – fx) +1 𝑻 nyout = (nyin + 2p – fy) +1 nyout 𝑻 ncout = nf with nf = number of filters s = stride nxout fx,fy= filters size p = padding nxin,nyin= input size Karpathy 2016

  11. 1 2 1 example of convolution of an edge detecting filter : Sobel filter 0 0 0 -1 -2 -1 from setosa.io

  12. neuron connected to a small region of solving FCN bad scaling : 1 sparsity of connections input only ( localized receptive field ) learning convolutional whole input space tiled with RF re-using 2 parameter sharing filters to enforce … the same parameters ( feature maps ) reminiscent of how visual information is represented across the brain surface retinotopic maps localized feature detectors

  13. thinking to ... we may want to hardwire some amount of translation tolerance in our network! 2 nonlinear blur and down convolutionally apply a pooling operation sampling β€œreplacing” a β€œ max ” filter to the input subregions with their max value 𝒛 𝒋,π’Œ = π’π’ƒπ’š π’’π’‘π’‘π’Ž 𝒋,π’Œ with π’’π’‘π’‘π’Ž 𝒋,π’Œ = π’š π’‹βˆ’π’,π’Œβˆ’π’Ž with k=1,..fy and l=1,..fx 1 1 2 4 max pool with 6 8 5 6 7 8 usually done with stride fx=fy=2 and s=2 nyin nyout s=fx=fy to have non- overlapping subregions 1 0 3 2 4 3 3 4 1 2 nxout nxin

  14. pooling operation will be applied to convolutional layer volumes independently to each feature map … individual feature map dimension of output volume: nxout = (nxin + 2p – fy) +1 nyout = (nyin + 2p – fx) +1 𝑇 𝑇 but since usually p=0, s=2 and fx=fy=2 … ncout = nf nyout = nyin nxout = nxin πŸ‘ πŸ‘ also for RF size calculation old formula still holds β€’ less computationally expansive number of parameters reduced by 75% β€’ less likely to overfit Karpathy 2016

  15. underlie transformation tolerance build up observed max -like pooling computation through the primate shape processing stream … complex cell position tolerant oriented edge detector neuron a classical example … V1 simple & complex cells max pooling simple cells position selective oriented edge detector neuron

  16. Lee et al. 2009 conv3 pool3 conv1 pool1 conv2 pool2 conv4 read out of task- relevant information output representation input image features noncategorical categorial … more and more abstract … local high level low level global trsf. sensitive combine simpler features to build more complex ones trsf. invariant

  17. we can consider stacks of convolutional layers as visual feature extractors … Features learned in solving one supervised task can frequently be useful in different contexts. No need to learn every feature from scratch for new tasks ! re-use the first N-layers of a network with pre-trained weights (on different task ) transfer learning ... depends on how distant task domain involved are! … how far in depth push N? face recognition & face recognition & far domains close domains satellite image classification emotion recognition low N high N only low-level features in common common high-level features extends applicability of deep learning in the small data regime

  18. imagine to start with a trained face recognition system now you want a car model recognition one high level features will be poorly trasferable (too domain specific): strip away last layers! conv1 pool1 conv2 pool2 conv3 pool3 conv4 softmax layer p(identity|face) input image (face) features

  19. ontop of that stick some new conv you are left with a general purpose middle-level feature extractor layers and a new softmax output with training (much less) you will build new car-specific high-level features and a working classifier conv1 pool1 conv2 pool2 conv3 pool3 conv4 softmax layer p(model|car) input image (car) features

  20. may be interpreted as reflecting the compositionality of the hierarchical structure of CNNs layers (and features) visual world (objects are made of parts and subpart etc…) reminiscent of anatomical and functional hierarchy of visual pathways: ventral stream V1 V2 V4 PIT CIT β€’ response latency increase AIT β€’ RF size increase β€’ tuning complexity increase β€’ transformation tolerance increase β€’ linear decodability increase Huberman et al. 2011

  21. this kind of hierarchical brain processing of visual shape information has been modelled throughout the years (80’, 90’) … ... from Fukushima’s Neocognitron to Poggio’s HMAX model C2 S2 S : shape selectivity build-up ( AND -like operations) C1 C : transformation tolerance build-up ( OR -like operations) S1 biologically-derived ideas instantiated by these models inspired the birth of modern CNNs architectures … Riesenhuber et al. 1999 ( Riesenhuber & Poggio 1999 )

  22. first successful convnet (handwritten digit recognition) … first of which was Yan LeCun’s LeNet ( β€˜98) β€’ conv filter size 5x5 (p=0 ↔ β€œvalid”, s=1) β€’ first applying stack of conv and pool layers followed by fc ones β€’ pooling filter size 2x2 (p=0, s=2 ) β€’ shallow: 2 conv layers interleaved with pooling πŸ•πŸ βˆ™ 𝟐𝟏 πŸ’ parameters (small) β€’ Ng. 2017

Recommend


More recommend