con convol oluti tion onal neural netw twork orks
play

Con Convol oluti tion onal Neural Netw twork orks Presented by - PowerPoint PPT Presentation

Con Convol oluti tion onal Neural Netw twork orks Presented by Tristan Maidment Adapted from Ke Yus Slides Ou Outlin line Neural Network recap Building blocks of CNNs Architecture of CNNs Visualizing and understanding


  1. Con Convol oluti tion onal Neural Netw twork orks Presented by Tristan Maidment Adapted from Ke Yu’s Slides

  2. Ou Outlin line • Neural Network recap • Building blocks of CNNs • Architecture of CNNs • Visualizing and understanding CNNs • More applications

  3. Ne Neur ural Ne Network rk Recap

  4. Multilayer Perceptron (MLP) P) Fully-connected (FC) layer • A layer has full connections to all activations in the previous layer 𝑏 ["] 𝑏 [!] 𝑏 [!] = 𝜏 (𝑋 [!] 𝑌 + 𝑐 [!] ) 𝑋 ["] 𝑋 [!] 𝑌 𝑐 ["] 𝑋 [#] 𝑋 [!] ~ 4,3 , 𝑌~ 3, 𝑛 , 𝑏 [!] ~(4, 𝑛) 𝑐 [!] [!] ["] 𝑏 ! 𝑏 ! 𝑐 [#] 𝑧 $ 𝑦 ! 𝑏 ["] = 𝜏 (𝑋 ["] 𝑏 [!] + 𝑐 ["] ) [!] ["] 𝑏 " 𝑏 " 𝑔 𝑦 " 𝑋 ["] ~ 4,4 , 𝑏 [!] ~ 4, 𝑛 , 𝑏 ["] ~(4, 𝑛) [!] ["] 𝑏 # 𝑏 # 𝑦 # 𝑧 = 𝑔 (𝑋 [#] 𝑏 ["] + 𝑐 [#] ) ["] $ [!] 𝑏 & 𝑏 & 𝑋 [#] ~ 1,4 , 𝑏 ["] ~ 4, 𝑛 , $ 𝑧~(1, 𝑛)

  5. Ac Acti tivati tion Fu Functi tions a a x x 1 𝑢𝑏𝑜ℎ 𝑦 = 𝑓 ( − 𝑓 '( 𝜏 𝑦 = 1 + 𝑓 '( 𝑓 ( + 𝑓 '( a a x x 𝑆𝑓𝑀𝑉: max 0, 𝑦 𝑀𝑓𝑏𝑙𝑧 𝑆𝑓𝑀𝑉: max 0.1𝑦, 𝑦

  6. Ba Backpropagati tion Al Algori rith thm 1.The network is initialized with randomly chosen weights 2.Implement forward propagation to get all intermediates 𝑨 ["] , 𝑏 ["] 3.Compute cost function 𝐾 𝑋, 𝑐 4.Network back propagates the error and calculates the gradients 5.Adjust the weights of the network 𝑋 ["] : = 𝑋 ["] − 𝛽 & 𝑒[𝑋 ["] ] 𝑐 ["] : = 𝑐 ["] − 𝛽 & 𝑒[𝑐 ["] ] 6.Repeat the above steps until the error is acceptable

  7. Co Comp mpute Gradients ts 𝑋 [$] 𝑐 [$] 𝑦 𝑨 ["] = 𝑋 ["] 𝑦 + 𝑐 ["] 𝑏 ["] = 𝜏(𝑨 ["] ) 𝑨 [$] = 𝑋 [$] 𝑏 ["] + 𝑐 [$] 𝑧 = 𝜏(𝑨 [$] ) 𝑋 ["] / ℒ(/ 𝑧, y) 𝑐 ["] ℒ $ 𝑧, y = −(𝑧 log $ 𝑧 + 1 − 𝑧 log(1 − $ 𝑧)) 𝑒[𝑏 [!] ] = 𝑒[𝑨 " ] 𝑒𝑨 " 𝑧] = 𝑒ℒ 𝑧 = − 𝑧 𝑧 + 1 − 𝑧 𝑒𝑏 ! = 𝑋 " ) 𝑒[𝑨 " ] 𝑒[$ 𝑒$ $ 1 − $ 𝑧 = 𝑒[𝑏 [!] ] 𝑒𝑏 [!] 𝑒[𝑨 " ] = 𝑒ℒ 𝑒$ 𝑧 𝑒𝑨 ! = 𝑋 " ) 𝑒[𝑨 " ] ∗ 𝜏′(z ! ) 𝑒 𝑨 ! 𝑒𝑨 ["] = $ 𝑧 − 𝑧 𝑒$ 𝑧 𝑒[𝑋 " ] = 𝑒ℒ 𝑒$ 𝑧 𝑒𝑨 " 𝑒𝑋 [!] = 𝑒[𝑨 [!] ] 𝑒𝑨 [!] 𝑒𝑋 " = 𝑒[𝑨 " ]𝑏 ! ! 𝑒𝑋 ! = 𝑒[𝑨 [!] ]𝑦 ) 𝑒$ 𝑧 𝑒𝑨 " 𝑒[𝑐 ["] ] = 𝑒ℒ 𝑒$ 𝑧 𝑒𝑨 ["] 𝑒𝑐 [!] = 𝑒[𝑨 [!] ] 𝑒𝑨 [!] 𝑒𝑐 " = 𝑒[𝑨 " ] 𝑒𝑋 ! = 𝑒[𝑨 [!] ] 𝑒$ 𝑧 𝑒𝑨 ["]

  8. Op Optim imiz izatio ion – Lea Learning g Rate e and Momen entu tum • Stochastic gradient descent (mini-batch gradient descent) • SGD with momentum prevents oscillations 𝑤 !" = 𝛾𝑤 !" + 1 − 𝛾 𝑒𝑋 𝑋 = 𝑋 − 𝛽𝑤 !" , 𝑤 !# = 𝛾𝑤 !# + 1 − 𝛾 𝑒𝑐 𝑐 = 𝑐 − 𝛽𝑤 !# • Adaptive Learning Rate − RMSProp 𝛽 𝑇 !" = 𝛾𝑇 !" + 1 − 𝛾 𝑒𝑋 $ 𝑋 = 𝑋 − 𝑒𝑋 𝑇 !" − Adam 𝑇 !" = 𝛾 $ 𝑇 !" + 1 − 𝛾 $ 𝑒𝑋 $ 𝑤 !" = 𝛾 % 𝑤 !" + 1 − 𝛾 % 𝑒𝑋 𝛽 '()) 𝑋 = 𝑋 − 𝑤 !& '()) = 𝑤 !& '()) = 𝑇 !& '()) + 𝜁 𝑤 !& 𝑇 !& 𝑇 !& * * 𝛾 % 𝛾 $

  9. Re Regularization • Parameter Regularization: − Adding L 1 (Lasso) , L 2 (Ridge) or sometimes combined (Elastic) to cost function − Other norms are computationally ineffective • Dropout − Forward: multiply the output of hidden layer with mask of 0s and 1s randomly drawn from a Bernoulli distribution and remove all the links to the dropout nodes − Backward: do gradient descent through diminished network

  10. Co Convoluti tional Ne Neural Ne Netw twork rk Bu Building Bl Blocks

  11. Why not just use an MLP P for images? • MLP connects each pixel in an image to each neuron and suffers from the curse of dimensionality, so it does not scale well to higher resolution images. • For example: a small 200×200 pixel RGB image the first weight matrix of FC would have 200×200×3×#𝑜𝑓𝑣𝑠𝑝𝑜 = 12,000× #𝑜𝑓𝑣𝑠𝑝𝑜 parameters for the first layer alone

  12. Co Convoluti tion Operati tion General form: 𝑇 𝑢 = 5 𝑔 𝑏 𝑕 𝑢 − 𝑏 𝑒𝑏 Denoted by: 𝑡 𝑢 = (𝑔 ∗ 𝑕)(𝑢) Network terminology: 𝑔 : input, usually a multidimensional arrays 𝑕 : kernel or filter 𝑡 : output is referred to as the feature map • In practice, CNNs generally use kernels without flipping (i.e. cross-correlation)

  13. Fast Fourier Transforms on GPU PUs • Convolution theorem: Fourier transfer of a convolution of two signals is the pointwise product of their Fourier transforms. ℱ 𝑦 ∗ 𝑥 = ℱ 𝑦 & ℱ 𝑥 𝑦 ∗ 𝑥 = ℱ 23 {ℱ 𝑦 & ℱ 𝑥 } • Fast Fourier transfer (FFT) reduces the complexity of convolution from 𝑃(𝑜 1 ) to 𝑃(𝑜log 𝑜 ) • GPU-accelerated FFT implementations that perform up to 10 times faster than CPU only alternatives. (via NVIDIA CUDA)

  14. 2D 2D Co Convoluti tion Operati tion An example of 2D Convolution without kernel flipping. Boxes connected by arrows indicating how the upper-left element of the output is formed by applying the kernel to the corresponding upper-left region of the input. This process is called as template matching. The inner product between a kernel and a piece of image is maximized exactly when those two vectors match up.

  15. Ex Example les of f kernel l effects Identity Edge detection 1 Edge detection 2 Box blur 0 0 0 −1 −1 −1 0 1 0 1 1 1 1 0 1 0 −1 8 −1 1 −4 1 1 1 1 9 0 0 0 −1 −1 −1 0 1 0 1 1 1

  16. Mo Motivation 1: Local Connectivity • In FC layers, every output unit interacts with every input unit. • Because kernel is usually smaller than the input, CNN typically have sparse interactions. • Store fewer parameters which both reduces the memory requirements and improves statistical efficiency. • Compute the output requires fewer operations.

  17. Mo Motivation 1: Local Connectivity Growing Receptive Fields • In a deep convolutional network, units in the deeper layers may indirectly interact with a larger portion of the input. • This allows the network to efficiently describe complicated interactions from constructing simple building blocks that each describe only sparse interactions. • For example, h 3 is connected to 3 input variables, while g 3 is connected to all 5 input variables through indirect connections

  18. Mo Motivation 2: Parameter Sharing • In a traditional neural network, each element of the weight matrix is used exactly once when computing the output of a layer. • In a convolutional neural network, each member of the kernel is used at every position of the input (except some of the boundary pixels). • Parameter sharing means that rather than learning a separate set of parameters for every location, we learn only one set. • It does further reduce the storage requirement of model parameters. Thus convolution is dramatically more efficient than dense matrix multiplication in terms of memory requirements and statistical efficiency

  19. Mo Motivation 2: Parameter Sharing Input size: 320 by 280 Kernel size: 2 by 1 Output size: 319 by 280 • Image on right is formed by taking each pixel and subtracting the value of its neighboring pixel. Output image shows the vertically oriented edges. • The input image is 280 pixels tall and 320 pixels wide. The output image is 319 pixels wide. • CNN stores 2 parameters, while to describe the same transformation with a matrix multiplication would need 320×280×319× 280 > 8e9 weights

  20. Mo Motivation 3: Equivariance to Translation • Parameter sharing causes the layer to have a property known as equivariance to translation. • With images, convolution creates a 2D feature maps. If we move the object in the input, it’s representation will move the same amount in the output. • Experiments have show that many CNNs detect simple edges in the first layer. The same edges appear everywhere in the image, so the same kernel can be used to extract features throughout.

  21. Pa Padding 6 by 6 A 4 by 4 3 by 3 A 1 1 1 B B B ∗ 1 1 1 = B B B B 1 1 1 B B B Downsides of convolution • Image shrinks after applying convolutional operation. In a very deep neural network, after many steps, we end up with a very small output. • Pixels on the corners or edges are used much less than pixels in the middle. Lots of information from the edges of the image are throwed away.

  22. Ze Zero Padding 8 by 8 6 by 6 0 0 0 0 0 0 0 0 A A 0 0 A 3 by 3 A A 0 0 0 0 ∗ = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Padding the image with additional border(s) • Set pixel values to 0 on the border

  23. Ze Zero Padding Graph • Consider a filter of width six at every layer • Starting from an input of sixteen pixels, without zero padding, we are only able to have three convolutional layers • Adding five zeros to each layer prevents the representation from shrinking with depth

Recommend


More recommend