Presentation about Deep Learning --- Zhongwu xie
Contents 1.Brief introduction of Deep learning. 2.Brief introduction of Backpropagation. 3.Brief introduction of Convolutional Neural Networks.
Deep learning
I . Introduction to Deep Learning Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts , with each concept defined in relation to simpler concepts , and more abstract representations computed in terms of less abstract ones.---Ian Goodfellow
I . Introduction to Deep Learning In the plot on the left , A Venn diagram showing how deep learning is a kind of representation learning , which is in turn of machine learning. In the plot on the left ,the graph shows that deep learning has Multilayer.
I . What is Deep Learning β’ Data: π¦ π , π§ π 1 β€ π β€ π β’ Model: ANN β’ Criterion: -Cost function: π(π§, π(π¦)) -Empirical risk minimization: π π = 1 π π(π§ π , π(π¦ π , π)) π Ο π=1 -Regularization: || π₯ ||, | π₯ | 2 , Early Stopping , Dropout -objective function οΌ ππππ π π + Ξ» β (Regularization Function) β’ Algorithm : BP Gradient descent Learning is cast as optimization.
II . Why should we need to learn Deep Learning? --- Efficiency famous Instances : self-driven β’ Speech Recognition AlphaGo ---The phoneme error rate on TIMIT: Basing on HMM-GMM in 1990s : about 26% Restricted Boltzmann machines(RBMs) in 2009: 20.7%; LSTM-RNN in 2013:17.7% β’ Computer Vision ---The Top- 5 error of ILSVRC 2017 Classification Task is 2.251%, while human beingβs is 5.1%. β’ Natural Language Processing ---language model (n-gram) Machine translation β’ Recommender Systems ---Recommend ads , social network news feeds , movies , jokes , or advice from experts etc.
Backward propagation
I . Introduction to Notation π¦ 1 π¨ = π₯ π π¦ + π π¦ 2 π = ΰ· π§ π₯ π π¦ + π π(π¦) π¨ π π = π(π¨) π¦ 3 layer1 [2] π₯ 43 layer0 layer2 π¦ 1 π is the weight from the π π’β π₯ ππ neuron in the (π β 1) π’β layer to the π¦ 2 π π’β neuron in the π π’β layer. π¦ 3
I . Introduction to Forward propagation and Notation [1] = π₯ 1 [1] = π(π¨ 1 [1] , [1] ) 1 π π¦ + π 2 π¦ 1 π¨ 1 π 1 [1] = π₯ 2 [1] = π(π¨ 2 [1] , [1] ) 1 π π¦ + π 2 π¨ 2 π 2 π¦ 2 [1] = π₯ 3 [1] = π(π¨ 3 π§ = π ΰ· 1 π π¦ + π 3 [1] , [1] ) π¨ 3 π 3 [1] = π₯ 4 [1] = π(π¨ 4 [1] , [1] ) 1 π π¦ + π 4 π¨ 4 π 4 π¦ 3 π₯ [1] 1 π¦ π + π 1 [1] [1] 3 T [1] π¨ 1 [1] Ο π=1 π₯ π1 π 1 π 1 [1] [1] [1] [1] π₯ 11 π₯ 12 π₯ 13 π₯ 14 1 π¦ π + π 2 π¦ 1 [1] [1] [1] [1] 3 = Ο π=1 π₯ π2 π¨ 2 π 2 π 2 π¨ [1] = π [1] = = π π¨ 1 , where π π¦ is π’βe sigmoid function [1] [1] [1] [1] π¦ 2 + = π₯ 21 π₯ 22 π₯ 23 π₯ 24 1 π¦ π + π 3 [1] [1] [1] [1] 3 Ο π=1 π 3 π 3 π₯ π3 π¨ 3 π¦ 3 [1] [1] [1] [1] π₯ 31 π₯ 32 π₯ 33 π₯ 34 1 π¦ π + π 4 [1] [1] [1] [1] 3 π 4 π 4 Ο π=1 π₯ π4 π¨ 4 πππ‘π’ ππ£πππ’πππ: π π, π§ ππ₯ [1] = ππ(π,π§) ππ₯ [1] , ππ [1] = ππ(π,π§) ππ [1]
II . Backward propagation. --- the chain rule If π¦ = π π₯ , π§ = π π¦ , π¨ = π(π§) ππ¨ ππ¨ ππ§ ππ¦ So, ππ₯ = ππ§ ππ¦ ππ₯ ---the functions of neural network are same as the above function , so we can use the chain rule to the gradient of the neural network. π¦ π = π (z) π¨ = π₯ π π¦ + π π₯ π π, π§ π
II . Backward propagation. --- the chain rule π₯ [2] π π, π§ = β[π§ππππ + 1 β π§ log 1 β π ] π [2] π¦ π₯ [1] π¨ [1] = π₯ [1] π¦ + π [1] π [1] = π ( π¨ [1] ) π¨ [2] = π₯ [2] π [1] + π [2] π [2] = π ( π¨ [2] ) π π [2] , π§ π [1] ππ(π,π§) π§ 1βπ§ ππ [2] = ππ [2] ππ¨ [2] ππ [1] ππ¨ [1] ππ₯ [1] = ππ(π,π§) ππ(π,π§) ππ [2] = β π + ππ₯ [1] = ππ¨ [1] π¦ π ππ₯ [1] = ππ [2] Γ ππ¨ [2] Γ ππ [1] Γ ππ¨ [1] Γ 1βπ ππ [2] ππ¨ [2] = ππ(π,π§) ππ(π,π§) ππ¨ [2] = π [2] β π§ Γ ππ [2] ππ¨ [2] Γ ππ¨ [2] ππ [1] Γ ππ [1] ππ¨ [1] Γ ππ¨ [1] ππ [1] = ππ(π, π§) = ππ(π, π§) ππ [2] Γ ππ¨ [2] = ππ [1] = ππ¨ [1] ππ [1] π [2] ππ [2] ππ¨ [2] ππ₯ [2] = ππ(π,π§) ππ(π,π§) ππ₯ [2] = ππ¨ [2] π 1 π Γ ππ¨ [2] Γ ππ₯ [2] = π [2] Γ ππ [2] ππ¨ [2] Γ ππ¨ [2] ππ [2] = ππ(π, π§) = ππ(π, π§) ππ [2] = ππ¨ [2] ππ [2] π [2] Γ ππ [2] ππ¨ [2] Γ ππ¨ [2] ππ [1] Γ ππ [1] ππ¨ [1] = ππ(π, π§) = ππ(π, π§) ππ¨ [1] π [2] ππ¨ [1] = π₯ 2 π ππ¨ [2] * π β² (π¨ [1] )
II . Summary : The Backpropagation [π] [π+1] ππ π ππ π ππ· π ππ₯ π ππ π ππ π ππ π β¦ β¦ [π] π+1 π πβ1 ππ π ππ· ππ π ππ π πβ2 βββ ππ π β¦ π βπ· β ΰ· π βπ₯ ππ π πβ1 π ππ π ππ π ππ π ππ π ππ₯ ππ mnπ..π β¦ β¦ βπ· [π] β¦ π πβ1 π+1 ππ π πβ2 βββ ππ π ππ· ππ· ππ π ππ π π = ΰ· π πβ1 π π ππ π ππ π ππ π ππ₯ ππ π ππ₯ β¦ β¦ ππ ππ mnπ..π The backpropagation algorithm is a clever way of keeping track of small perturbations the weights (and biases) as they propagate through the network , reach the output , and then affect the cost. ---Michael Nielsen
II . Summary : The Backpropagation algorithm 1.Input π¦ :Set the corresponding activation for the input layer. 2.Feedforward : For each π = π, π, β¦ , π compute π¨ [π] = π₯ [π] π [πβ1] + π [π] and π [π] = π π¨ π . 3.Output error ππ¨ [π] : ππ¨ [π] = π [π] - π§. T 4.Back propagate the cost error:For each l =L-1,L- 2,β¦2 compute : dz [π] = (w π+1 ) dz [π+1] β π β² (z [π] ) 5.Output : The gradient of the cost function is given by οΌ ππ(π,π§) ππ(π,π§) ππ₯ [π] = ππ₯ [π] = ππ¨ [π] π πβ1 π and ππ [π] = ππ [π] = ππ¨ [π] [π] and π [π] οΌ Update the π₯ ππ π [π] β π½ ππ(π,π§) [π] = π₯ π₯ ππ ππ [π] ππ₯ ππ [π] = π [π] βπ½ ππ(π,π§) π π π [π] π π π
Convolutional Neural Networks
1 . Types of layers in a convolutional network. β’ -Convolution β’ -Pooling β’ -Fully connected
2.1 Convolution in Neural Network 10 10 10 0 0 0 0 30 30 0 10 10 10 0 0 0 1 0 -1 0 30 30 0 10 10 10 0 0 0 1 0 -1 * = 0 30 30 0 10 10 10 0 0 0 1 0 -1 10 10 10 0 0 0 0 30 30 0 10 10 10 0 0 0 10 10 10 1 0 -1 * = 0 1 0 -1 10 10 10 1 0 -1 10 10 10
2.2 Multiple filters = * 3 Γ 3 Γ 3 4 Γ 4 4 Γ 4 Γ 2 6 Γ 6 Γ 3 = * 4 Γ 4 Why convolutions οΌ 3 Γ 3 Γ 3 ---Parameter sharing ---Sparsity of connections
3 . Pooling layers β’ Max pooling 1 3 2 1 Hyperparameters: 9 2 2 9 1 1 Max pool with 2 Γ 2 f:filter size filters and stride 2 s:stride 1 3 2 3 6 3 Max or average pooling 5 6 1 2 β’ Remove the redundancy information of convolutional layer . ---By having less spatial information you gain computation performance ---Less spatial information also means less parameters, so less chance to over- fit ---You get some translation invariance
3 . Full connection layer The CNNs help extract certain features from the image , then fully connected layer is able to generalize from these features into the output-space. [LeCun et al.,1998.Gradient-based learning applied to document recognition.]
4 . Classic networks---AlexNet ππ΅π β ππππ ππ΅π β ππππ 3 Γ 3 3 Γ 3 11 Γ 11 5 Γ 5 S=2 S=4 S=4 Same 27 Γ 27 Γ 96 27 Γ 27 Γ 256 13 Γ 13 Γ 256 55 Γ 55 Γ 96 Parameters:9216 Γ 4096 Γ 4096=154,618,822,656 227 Γ 227 Γ 3 ππ΅π β ππππ ππππ’πππ¦ = = 3 Γ 3 1000 3 Γ 3 3 Γ 3 3 Γ 3 Same S=2 13 Γ 13 Γ 256 9216 4096 4096 6 Γ 6 Γ 256 13 Γ 13 Γ 384 13 Γ 13 Γ 384
Thank you
Recommend
More recommend