presentation about deep learning
play

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief - PowerPoint PPT Presentation

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning. 2.Brief introduction of Backpropagation. 3.Brief introduction of Convolutional Neural Networks. Deep learning I . Introduction to Deep Learning


  1. Presentation about Deep Learning --- Zhongwu xie

  2. Contents 1.Brief introduction of Deep learning. 2.Brief introduction of Backpropagation. 3.Brief introduction of Convolutional Neural Networks.

  3. Deep learning

  4. I . Introduction to Deep Learning Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts , with each concept defined in relation to simpler concepts , and more abstract representations computed in terms of less abstract ones.---Ian Goodfellow

  5. I . Introduction to Deep Learning In the plot on the left , A Venn diagram showing how deep learning is a kind of representation learning , which is in turn of machine learning. In the plot on the left ,the graph shows that deep learning has Multilayer.

  6. I . What is Deep Learning β€’ Data: 𝑦 𝑗 , 𝑧 𝑗 1 ≀ 𝑗 ≀ 𝑛 β€’ Model: ANN β€’ Criterion: -Cost function: 𝑀(𝑧, 𝑔(𝑦)) -Empirical risk minimization: 𝑆 πœ„ = 1 𝑛 𝑀(𝑧 𝑗 , 𝑔(𝑦 𝑗 , πœ„)) 𝑛 Οƒ 𝑗=1 -Regularization: || π‘₯ ||, | π‘₯ | 2 , Early Stopping , Dropout -objective function : π‘›π‘—π‘œπ‘— 𝑆 πœ„ + Ξ» βˆ— (Regularization Function) β€’ Algorithm : BP Gradient descent Learning is cast as optimization.

  7. II . Why should we need to learn Deep Learning? --- Efficiency famous Instances : self-driven β€’ Speech Recognition AlphaGo ---The phoneme error rate on TIMIT: Basing on HMM-GMM in 1990s : about 26% Restricted Boltzmann machines(RBMs) in 2009: 20.7%; LSTM-RNN in 2013:17.7% β€’ Computer Vision ---The Top- 5 error of ILSVRC 2017 Classification Task is 2.251%, while human being’s is 5.1%. β€’ Natural Language Processing ---language model (n-gram) Machine translation β€’ Recommender Systems ---Recommend ads , social network news feeds , movies , jokes , or advice from experts etc.

  8. Backward propagation

  9. I . Introduction to Notation 𝑦 1 𝑨 = π‘₯ π‘ˆ 𝑦 + 𝑐 𝑦 2 𝑏 = ො 𝑧 π‘₯ π‘ˆ 𝑦 + 𝑐 𝑕(𝑦) 𝑨 𝑏 𝑏 = 𝑕(𝑨) 𝑦 3 layer1 [2] π‘₯ 43 layer0 layer2 𝑦 1 π‘š is the weight from the π‘˜ π‘’β„Ž π‘₯ π‘˜π‘™ neuron in the (π‘š βˆ’ 1) π‘’β„Ž layer to the 𝑦 2 𝑙 π‘’β„Ž neuron in the π‘š π‘’β„Ž layer. 𝑦 3

  10. I . Introduction to Forward propagation and Notation [1] = π‘₯ 1 [1] = 𝜏(𝑨 1 [1] , [1] ) 1 π‘ˆ 𝑦 + 𝑐 2 𝑦 1 𝑨 1 𝑏 1 [1] = π‘₯ 2 [1] = 𝜏(𝑨 2 [1] , [1] ) 1 π‘ˆ 𝑦 + 𝑐 2 𝑨 2 𝑏 2 𝑦 2 [1] = π‘₯ 3 [1] = 𝜏(𝑨 3 𝑧 = 𝑏 ො 1 π‘ˆ 𝑦 + 𝑐 3 [1] , [1] ) 𝑨 3 𝑏 3 [1] = π‘₯ 4 [1] = 𝜏(𝑨 4 [1] , [1] ) 1 π‘ˆ 𝑦 + 𝑐 4 𝑨 4 𝑏 4 𝑦 3 π‘₯ [1] 1 𝑦 𝑙 + 𝑐 1 [1] [1] 3 T [1] 𝑨 1 [1] Οƒ 𝑙=1 π‘₯ 𝑙1 𝑏 1 𝑐 1 [1] [1] [1] [1] π‘₯ 11 π‘₯ 12 π‘₯ 13 π‘₯ 14 1 𝑦 𝑙 + 𝑐 2 𝑦 1 [1] [1] [1] [1] 3 = Οƒ 𝑙=1 π‘₯ 𝑙2 𝑨 2 𝑐 2 𝑏 2 𝑨 [1] = 𝑏 [1] = = 𝜏 𝑨 1 , where 𝜏 𝑦 is π‘’β„Že sigmoid function [1] [1] [1] [1] 𝑦 2 + = π‘₯ 21 π‘₯ 22 π‘₯ 23 π‘₯ 24 1 𝑦 𝑙 + 𝑐 3 [1] [1] [1] [1] 3 Οƒ 𝑙=1 𝑏 3 𝑐 3 π‘₯ 𝑙3 𝑨 3 𝑦 3 [1] [1] [1] [1] π‘₯ 31 π‘₯ 32 π‘₯ 33 π‘₯ 34 1 𝑦 𝑙 + 𝑐 4 [1] [1] [1] [1] 3 𝑐 4 𝑏 4 Οƒ 𝑙=1 π‘₯ 𝑙4 𝑨 4 𝑑𝑝𝑑𝑒 π‘”π‘£π‘œπ‘‘π‘’π‘—π‘π‘œ: 𝑀 𝑏, 𝑧 𝑒π‘₯ [1] = πœ–π‘€(𝑏,𝑧) πœ–π‘₯ [1] , 𝑒𝑐 [1] = πœ–π‘€(𝑏,𝑧) πœ–π‘ [1]

  11. II . Backward propagation. --- the chain rule If 𝑦 = 𝑔 π‘₯ , 𝑧 = 𝑔 𝑦 , 𝑨 = 𝑔(𝑧) πœ–π‘¨ πœ–π‘¨ πœ–π‘§ πœ–π‘¦ So, πœ–π‘₯ = πœ–π‘§ πœ–π‘¦ πœ–π‘₯ ---the functions of neural network are same as the above function , so we can use the chain rule to the gradient of the neural network. 𝑦 𝑏 = 𝜏 (z) 𝑨 = π‘₯ π‘ˆ 𝑦 + 𝑐 π‘₯ 𝑀 𝑏, 𝑧 𝑐

  12. II . Backward propagation. --- the chain rule π‘₯ [2] 𝑀 𝑏, 𝑧 = βˆ’[π‘§π‘šπ‘π‘•π‘ + 1 βˆ’ 𝑧 log 1 βˆ’ 𝑏 ] 𝑐 [2] 𝑦 π‘₯ [1] 𝑨 [1] = π‘₯ [1] 𝑦 + 𝑐 [1] 𝑏 [1] = 𝜏 ( 𝑨 [1] ) 𝑨 [2] = π‘₯ [2] 𝑏 [1] + 𝑐 [2] 𝑏 [2] = 𝜏 ( 𝑨 [2] ) 𝑀 𝑏 [2] , 𝑧 𝑐 [1] πœ–π‘€(𝑏,𝑧) 𝑧 1βˆ’π‘§ 𝑒𝑏 [2] = πœ–π‘ [2] πœ–π‘¨ [2] πœ–π‘ [1] πœ–π‘¨ [1] 𝑒π‘₯ [1] = πœ–π‘€(𝑏,𝑧) πœ–π‘€(𝑏,𝑧) πœ–π‘ [2] = βˆ’ 𝑏 + πœ–π‘₯ [1] = 𝑒𝑨 [1] 𝑦 π‘ˆ πœ–π‘₯ [1] = πœ–π‘ [2] Γ— πœ–π‘¨ [2] Γ— πœ–π‘ [1] Γ— πœ–π‘¨ [1] Γ— 1βˆ’π‘ πœ–π‘ [2] 𝑒𝑨 [2] = πœ–π‘€(𝑏,𝑧) πœ–π‘€(𝑏,𝑧) πœ–π‘¨ [2] = 𝑏 [2] βˆ’ 𝑧 Γ— πœ–π‘ [2] πœ–π‘¨ [2] Γ— πœ–π‘¨ [2] πœ–π‘ [1] Γ— πœ–π‘ [1] πœ–π‘¨ [1] Γ— πœ–π‘¨ [1] 𝑒𝑐 [1] = πœ–π‘€(𝑏, 𝑧) = πœ–π‘€(𝑏, 𝑧) πœ–π‘ [2] Γ— πœ–π‘¨ [2] = πœ–π‘ [1] = 𝑒𝑨 [1] πœ–π‘ [1] 𝑏 [2] πœ–π‘ [2] πœ–π‘¨ [2] 𝑒π‘₯ [2] = πœ–π‘€(𝑏,𝑧) πœ–π‘€(𝑏,𝑧) πœ–π‘₯ [2] = 𝑒𝑨 [2] 𝑏 1 π‘ˆ Γ— πœ–π‘¨ [2] Γ— πœ–π‘₯ [2] = 𝑏 [2] Γ— πœ–π‘ [2] πœ–π‘¨ [2] Γ— πœ–π‘¨ [2] 𝑒𝑐 [2] = πœ–π‘€(𝑏, 𝑧) = πœ–π‘€(𝑏, 𝑧) πœ–π‘ [2] = 𝑒𝑨 [2] πœ–π‘ [2] 𝑏 [2] Γ— πœ–π‘ [2] πœ–π‘¨ [2] Γ— πœ–π‘¨ [2] πœ–π‘ [1] Γ— πœ–π‘ [1] 𝑒𝑨 [1] = πœ–π‘€(𝑏, 𝑧) = πœ–π‘€(𝑏, 𝑧) πœ–π‘¨ [1] 𝑏 [2] πœ–π‘¨ [1] = π‘₯ 2 π‘ˆ 𝑒𝑨 [2] * 𝜏 β€² (𝑨 [1] )

  13. II . Summary : The Backpropagation [π‘š] [π‘š+1] πœ–π‘ π‘˜ πœ–π‘ π‘Ÿ πœ–π· π‘š πœ–π‘₯ π‘š πœ–π‘ π‘˜ π‘˜π‘™ 𝑀 πœ–π‘ 𝑛 … … [π‘š] π‘š+1 𝑀 π‘€βˆ’1 πœ–π‘ π‘˜ πœ–π· πœ–π‘ 𝑛 πœ–π‘ π‘œ π‘€βˆ’2 βˆ™βˆ™βˆ™ πœ–π‘ π‘Ÿ … π‘š βˆ†π· β‰ˆ ෍ π‘š βˆ†π‘₯ π‘˜π‘™ 𝑀 π‘€βˆ’1 π‘š πœ–π‘ 𝑛 πœ–π‘ π‘œ πœ–π‘ π‘ž πœ–π‘ π‘˜ πœ–π‘₯ π‘˜π‘™ mnπ‘ž..π‘Ÿ … … βˆ†π· [π‘š] … 𝑀 π‘€βˆ’1 π‘š+1 πœ–π‘ π‘˜ π‘€βˆ’2 βˆ™βˆ™βˆ™ πœ–π‘ π‘Ÿ πœ–π· πœ–π· πœ–π‘ 𝑛 πœ–π‘ π‘œ π‘š = ෍ 𝑀 π‘€βˆ’1 π‘š π‘š πœ–π‘ 𝑛 πœ–π‘ π‘œ πœ–π‘ π‘ž πœ–π‘₯ πœ–π‘ π‘˜ πœ–π‘₯ … … π‘˜π‘™ π‘˜π‘™ mnπ‘ž..π‘Ÿ The backpropagation algorithm is a clever way of keeping track of small perturbations the weights (and biases) as they propagate through the network , reach the output , and then affect the cost. ---Michael Nielsen

  14. II . Summary : The Backpropagation algorithm 1.Input 𝑦 :Set the corresponding activation for the input layer. 2.Feedforward : For each π‘š = πŸ‘, πŸ’, … , 𝐌 compute 𝑨 [π‘š] = π‘₯ [π‘š] 𝑏 [π‘šβˆ’1] + 𝑐 [π‘š] and 𝑏 [π‘š] = 𝜏 𝑨 π‘š . 3.Output error 𝑒𝑨 [𝑀] : 𝑒𝑨 [𝑀] = 𝑏 [𝑀] - 𝑧. T 4.Back propagate the cost error:For each l =L-1,L- 2,…2 compute : dz [π‘š] = (w π‘š+1 ) dz [π‘š+1] βˆ— 𝜏 β€² (z [π‘š] ) 5.Output : The gradient of the cost function is given by : πœ–π‘€(𝑏,𝑧) πœ–π‘€(𝑏,𝑧) 𝑒π‘₯ [π‘š] = πœ–π‘₯ [π‘š] = 𝑒𝑨 [π‘š] 𝑏 π‘šβˆ’1 π‘ˆ and 𝑒𝑐 [π‘š] = πœ–π‘ [π‘š] = 𝑒𝑨 [π‘š] [π‘š] and 𝑐 [π‘š] : Update the π‘₯ π‘˜π‘™ π‘˜ [π‘š] βˆ’ 𝛽 πœ–π‘€(𝑏,𝑧) [π‘š] = π‘₯ π‘₯ π‘˜π‘™ π‘˜π‘™ [π‘š] πœ–π‘₯ π‘˜π‘™ [π‘š] = 𝑐 [π‘š] βˆ’π›½ πœ–π‘€(𝑏,𝑧) 𝑐 π‘˜ π‘˜ [π‘š] πœ– 𝑐 π‘˜

  15. Convolutional Neural Networks

  16. 1 . Types of layers in a convolutional network. β€’ -Convolution β€’ -Pooling β€’ -Fully connected

  17. 2.1 Convolution in Neural Network 10 10 10 0 0 0 0 30 30 0 10 10 10 0 0 0 1 0 -1 0 30 30 0 10 10 10 0 0 0 1 0 -1 * = 0 30 30 0 10 10 10 0 0 0 1 0 -1 10 10 10 0 0 0 0 30 30 0 10 10 10 0 0 0 10 10 10 1 0 -1 * = 0 1 0 -1 10 10 10 1 0 -1 10 10 10

  18. 2.2 Multiple filters = * 3 Γ— 3 Γ— 3 4 Γ— 4 4 Γ— 4 Γ— 2 6 Γ— 6 Γ— 3 = * 4 Γ— 4 Why convolutions ? 3 Γ— 3 Γ— 3 ---Parameter sharing ---Sparsity of connections

  19. 3 . Pooling layers β€’ Max pooling 1 3 2 1 Hyperparameters: 9 2 2 9 1 1 Max pool with 2 Γ— 2 f:filter size filters and stride 2 s:stride 1 3 2 3 6 3 Max or average pooling 5 6 1 2 β€’ Remove the redundancy information of convolutional layer . ---By having less spatial information you gain computation performance ---Less spatial information also means less parameters, so less chance to over- fit ---You get some translation invariance

  20. 3 . Full connection layer The CNNs help extract certain features from the image , then fully connected layer is able to generalize from these features into the output-space. [LeCun et al.,1998.Gradient-based learning applied to document recognition.]

  21. 4 . Classic networks---AlexNet π‘π΅π‘Œ βˆ’ π‘„π‘π‘π‘š π‘π΅π‘Œ βˆ’ π‘„π‘π‘π‘š 3 Γ— 3 3 Γ— 3 11 Γ— 11 5 Γ— 5 S=2 S=4 S=4 Same 27 Γ— 27 Γ— 96 27 Γ— 27 Γ— 256 13 Γ— 13 Γ— 256 55 Γ— 55 Γ— 96 Parameters:9216 Γ— 4096 Γ— 4096=154,618,822,656 227 Γ— 227 Γ— 3 π‘π΅π‘Œ βˆ’ π‘„π‘π‘π‘š 𝑇𝑝𝑔𝑒𝑛𝑏𝑦 = = 3 Γ— 3 1000 3 Γ— 3 3 Γ— 3 3 Γ— 3 Same S=2 13 Γ— 13 Γ— 256 9216 4096 4096 6 Γ— 6 Γ— 256 13 Γ— 13 Γ— 384 13 Γ— 13 Γ— 384

  22. Thank you

Recommend


More recommend