Convolutional Neural Networks QSB 2018: Learning and Artificial intelligence β Tutorial session 3 Giulio Matteucci
Neural network architectures for computer vision tasks Images are high dimensional ! input x β πΊ π where n = nx x ny x nc β¦ is large! representing them with pixel intensity values nc number of parameters (weights) grows quadratically with resolution ny fully connected networks do not scale well to real world computer vision problems! Can we exploit our prior knowledge about the the visual nx world to design a better architecture for vision?
Start from two considerations about natural visual input β¦. natural images are made of sparse , local visual features are local 1 independent components β¦ because β¦ visual scenes are made of (often) repeated elements natural image statistics is (approximately) visual features can show up everywhere 2 stationary across visual space because β¦ visual objects undergo identity preserving transformations (e.g. translation) HyvΓ€rinen et al. βNatural Image Statisticsβ, 2009
neurons search for the pattern neurons as filtersβ¦. dot product measures similarity stored in their weighs in input βsynapseβ π₯ 0 π¦ 0 βdendriteβ π₯ 0 π¦ 0 input βaxonβ βsomaβ π ΰ· π₯ π π¦ π + π π₯ 1 π¦ 1 π π ΰ· π₯ π π¦ π + π output βaxonβ π π₯ 2 π¦ 2 β¦ when input vector is similar enough to weight vector: response = preferred feature is detected
1 learn small localized filters : β¦ to do so letβs keep spatial structure (i.e. do not flatten input) fully connected units locally connected units h w ο» β learning global filters for local features learning local filters for local features nx x ny parameters per hidden unit h x w parameters per hidden unit Eg: nx=ny=200 ---> 40000 Eg: h=w=4 ---> 16 parameters per unit parameters per unit Costly and inefficient! Cheap and efficient!
convolution operation reuse localized filters unaltered across different part of image 2 π³ π΄ π π,π = ΰ· ΰ· π πβπ,πβπ π π,π π=βπ³ π=βπ΄
nout filter output 2 1 2 nout = (nin β f) +1 applying convolution 1 0 -1 output naturally shrinks input 1 2 -1 1 -3 We can avoid this β¦ f nin -2 2 1 2 1 adding 0s at the padded nout = (nin + 2p β f) +1 input border convolution 0 1 2 -1 1 -3 0 p p=0 | ninβ nout called: β valid β convolution p | nin=nout called: β same β convolution
receptive field (RF) when cascading multiple convolution operations useful to introduce: region of the input space from which a unit of interest filter given neuron receives information from 1 1 -1 1 -1 1 layer k=4 β’ π π equal to filter size in first layer 1 1 1 -3 2 layer k=3 β’ grow as of f-1 each next layer layer k=2 2 3 -2 1 -2 2 -1 for the k th one, recursively: layer k=1 -2 2 1 2 -1 -2 2 1 1 π π = π πβπ + π π β π input 0 1 2 -1 1 -3 1 2 -1 1 -3 π π = π π with: π π
modern CNNs use very small filters (e.g. 3x3) to develop selectivity for meaningful pattern we need larger RF ! we may want to make them grow faster β¦ s filter -2 1 1 change the strided βstepβ of filter 1 0 -1 convolution displacement 0 1 2 -1 1 -3 0 strided convolution also act as a downsampling πβπ π π π π = π πβπ + π π β π Ο π=π in this way RF size can grow faster : greatly reducing output size s = stride β¦. considering stride (and padding) nout = (nin + 2p β f) +1 p = padding with the output size will be: π» f = filter dimension
we are learning multiple filters , acting on all input channels togetherβ¦ filter output will form a β feature map β convolutional layer stack different feature maps on the third dimension (as different channels): ncout size of output volume: nxout = (nxin + 2p β fx) +1 π» nyout = (nyin + 2p β fy) +1 nyout π» ncout = nf with nf = number of filters s = stride nxout fx,fy= filters size p = padding nxin,nyin= input size Karpathy 2016
1 2 1 example of convolution of an edge detecting filter : Sobel filter 0 0 0 -1 -2 -1 from setosa.io
neuron connected to a small region of solving FCN bad scaling : 1 sparsity of connections input only ( localized receptive field ) learning convolutional whole input space tiled with RF re-using 2 parameter sharing filters to enforce β¦ the same parameters ( feature maps ) reminiscent of how visual information is represented across the brain surface retinotopic maps localized feature detectors
thinking to ... we may want to hardwire some amount of translation tolerance in our network! 2 nonlinear blur and down convolutionally apply a pooling operation sampling βreplacingβ a β max β filter to the input subregions with their max value π π,π = πππ ππππ π,π with ππππ π,π = π πβπ,πβπ with k=1,..fy and l=1,..fx 1 1 2 4 max pool with 6 8 5 6 7 8 usually done with stride fx=fy=2 and s=2 nyin nyout s=fx=fy to have non- overlapping subregions 1 0 3 2 4 3 3 4 1 2 nxout nxin
pooling operation will be applied to convolutional layer volumes independently to each feature map β¦ individual feature map dimension of output volume: nxout = (nxin + 2p β fy) +1 nyout = (nyin + 2p β fx) +1 π π but since usually p=0, s=2 and fx=fy=2 β¦ ncout = nf nyout = nyin nxout = nxin π π also for RF size calculation old formula still holds β’ less computationally expansive number of parameters reduced by 75% β’ less likely to overfit Karpathy 2016
underlie transformation tolerance build up observed max -like pooling computation through the primate shape processing stream β¦ complex cell position tolerant oriented edge detector neuron a classical example β¦ V1 simple & complex cells max pooling simple cells position selective oriented edge detector neuron
Lee et al. 2009 conv3 pool3 conv1 pool1 conv2 pool2 conv4 read out of task- relevant information output representation input image features noncategorical categorial β¦ more and more abstract β¦ local high level low level global trsf. sensitive combine simpler features to build more complex ones trsf. invariant
we can consider stacks of convolutional layers as visual feature extractors β¦ Features learned in solving one supervised task can frequently be useful in different contexts. No need to learn every feature from scratch for new tasks ! re-use the first N-layers of a network with pre-trained weights (on different task ) transfer learning ... depends on how distant task domain involved are! β¦ how far in depth push N? face recognition & face recognition & far domains close domains satellite image classification emotion recognition low N high N only low-level features in common common high-level features extends applicability of deep learning in the small data regime
imagine to start with a trained face recognition system now you want a car model recognition one high level features will be poorly trasferable (too domain specific): strip away last layers! conv1 pool1 conv2 pool2 conv3 pool3 conv4 softmax layer p(identity|face) input image (face) features
ontop of that stick some new conv you are left with a general purpose middle-level feature extractor layers and a new softmax output with training (much less) you will build new car-specific high-level features and a working classifier conv1 pool1 conv2 pool2 conv3 pool3 conv4 softmax layer p(model|car) input image (car) features
may be interpreted as reflecting the compositionality of the hierarchical structure of CNNs layers (and features) visual world (objects are made of parts and subpart etcβ¦) reminiscent of anatomical and functional hierarchy of visual pathways: ventral stream V1 V2 V4 PIT CIT β’ response latency increase AIT β’ RF size increase β’ tuning complexity increase β’ transformation tolerance increase β’ linear decodability increase Huberman et al. 2011
this kind of hierarchical brain processing of visual shape information has been modelled throughout the years (80β, 90β) β¦ ... from Fukushimaβs Neocognitron to Poggioβs HMAX model C2 S2 S : shape selectivity build-up ( AND -like operations) C1 C : transformation tolerance build-up ( OR -like operations) S1 biologically-derived ideas instantiated by these models inspired the birth of modern CNNs architectures β¦ Riesenhuber et al. 1999 ( Riesenhuber & Poggio 1999 )
first successful convnet (handwritten digit recognition) β¦ first of which was Yan LeCunβs LeNet ( β98) β’ conv filter size 5x5 (p=0 β βvalidβ, s=1) β’ first applying stack of conv and pool layers followed by fc ones β’ pooling filter size 2x2 (p=0, s=2 ) β’ shallow: 2 conv layers interleaved with pooling ππ β ππ π parameters (small) β’ Ng. 2017
Recommend
More recommend