cs7015 deep learning lecture 4
play

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, - PowerPoint PPT Presentation

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4


  1. CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  2. References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation 2/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  3. Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons) 3/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  4. The input to the network is an n -dimensional h L = ˆ y = f ( x ) vector The network contains L − 1 hidden layers (2, in this case) having n neurons each a 3 W 3 Finally, there is one output layer containing k b 3 h 2 neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer a 2 can be split into two parts : pre-activation and W 2 b 2 activation ( a i and h i are vectors) h 1 The input layer can be called the 0-th layer and the output layer can be called the ( L )-th layer a 1 W i ∈ R n × n and b i ∈ R n are the weight and bias W 1 b 1 between layers i − 1 and i (0 < i < L ) W L ∈ R n × k and b L ∈ R k are the weight and bias x 1 x 2 x n between the last hidden layer and the output layer ( L = 3 in this case) 4/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  5. h L = ˆ y = f ( x ) The pre-activation at layer i is given by a i ( x ) = b i + W i h i − 1 ( x ) a 3 W 3 The activation at layer i is given by b 3 h 2 h i ( x ) = g ( a i ( x )) a 2 where g is called the activation function (for W 2 b 2 example, logistic, tanh, linear, etc. ) h 1 The activation at the output layer is given by a 1 f ( x ) = h L ( x ) = O ( a L ( x )) W 1 b 1 where O is the output activation function (for x 1 x 2 x n example, softmax, linear, etc. ) To simplify notation we will refer to a i ( x ) as a i and h i ( x ) as h i 5/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  6. h L = ˆ y = f ( x ) The pre-activation at layer i is given by a i = b i + W i h i − 1 a 3 W 3 The activation at layer i is given by b 3 h 2 h i = g ( a i ) a 2 where g is called the activation function (for W 2 b 2 example, logistic, tanh, linear, etc. ) h 1 The activation at the output layer is given by a 1 f ( x ) = h L = O ( a L ) W 1 b 1 where O is the output activation function (for x 1 x 2 x n example, softmax, linear, etc. ) 6/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  7. h L = ˆ y = f ( x ) Data: { x i , y i } N i =1 Model: a 3 y i = f ( x i ) = O ( W 3 g ( W 2 g ( W 1 x + b 1 ) + b 2 ) + b 3 ) ˆ W 3 b 3 h 2 Parameters: θ = W 1 , .., W L , b 1 , b 2 , ..., b L ( L = 3) a 2 W 2 Algorithm: Gradient Descent with Back- b 2 h 1 propagation (we will see soon) Objective/Loss/Error function: Say, a 1 N k min 1 � � W 1 y ij − y ij ) 2 b 1 (ˆ N i =1 j =1 x 1 x 2 x n In general, min L ( θ ) where L ( θ ) is some function of the parameters 7/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  8. Module 4.2: Learning Parameters of Feedforward Neural Networks (Intuition) 8/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  9. The story so far... We have introduced feedforward neural networks We are now interested in finding an algorithm for learning the parameters of this model 9/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  10. h L = ˆ y = f ( x ) Recall our gradient descent algorithm Algorithm: gradient descent() a 3 t ← 0; W 3 max iterations ← 1000; b 3 h 2 w 0 , b 0 ; Initialize while t ++ < max iterations do w t +1 ← w t − η ∇ w t ; a 2 W 2 b t +1 ← b t − η ∇ b t ; b 2 h 1 end a 1 W 1 b 1 x 1 x 2 x n 10/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  11. h L = ˆ y = f ( x ) Recall our gradient descent algorithm We can write it more concisely as Algorithm: gradient descent() a 3 W 3 t ← 0; b 3 h 2 max iterations ← 1000; Initialize θ 0 = [ w 0 , b 0 ]; while t ++ < max iterations do a 2 W 2 θ t +1 ← θ t − η ∇ θ t ; b 2 h 1 end � ∂ L ( θ ) � T ∂w t , ∂ L ( θ ) where ∇ θ t = a 1 ∂b t Now, in this feedforward neural network, W 1 b 1 instead of θ = [ w, b ] we have θ = x 1 x 2 x n [ W 1 , W 2 , .., W L , b 1 , b 2 , .., b L ] We can still use the same algorithm for learning the parameters of our model 11/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  12. h L = ˆ y = f ( x ) Recall our gradient descent algorithm We can write it more concisely as Algorithm: gradient descent() a 3 W 3 t ← 0; b 3 h 2 max iterations ← 1000; θ 0 = [ W 0 1 , ..., W 0 L , b 0 1 , ..., b 0 Initialize L ]; while t ++ < max iterations do a 2 W 2 θ t +1 ← θ t − η ∇ θ t ; b 2 h 1 end � ∂ L ( θ ) � T ∂W 1 ,t , ., ∂ L ( θ ) ∂W L,t , ∂ L ( θ ) ∂b 1 ,t , ., ∂ L ( θ ) where ∇ θ t = a 1 ∂b L,t Now, in this feedforward neural network, W 1 b 1 instead of = [ w, b ] we have = θ θ [ W 1 , W 2 , .., W L , b 1 , b 2 , .., b L ] x 1 x 2 x n We can still use the same algorithm for learning the parameters of our model 12/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  13. Except that now our ∇ θ looks much more nasty   ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂W 111 . . . ∂W 211 . . . ∂W 21 n . . . ∂W L, 11 . . . . . . ∂W 11 n ∂W L, 1 k ∂W L, 1 k ∂b 11 ∂b L 1       ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂W 121 . . . ∂W 221 . . . ∂W 22 n . . . ∂W L, 21 . . . . . .   ∂W 12 n ∂W L, 2 k ∂W L, 2 k ∂b 12 ∂b L 2   . . . . . . . . . . . . . .   . . . . . . . . . . . . . .   . . . . . . . . . . . . . .   ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂ L ( θ ) ∂W 1 n 1 . . . ∂W 2 n 1 . . . ∂W 2 nn . . . ∂W L,n 1 . . . . . . ∂W 1 nn ∂W L,nk ∂W L,nk ∂b 1 n ∂b Lk ∇ θ is thus composed of ∇ W 1 , ∇ W 2 , ... ∇ W L − 1 ∈ R n × n , ∇ W L ∈ R n × k , ∇ b 1 , ∇ b 2 , ..., ∇ b L − 1 ∈ R n and ∇ b L ∈ R k 13/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  14. We need to answer two questions How to choose the loss function L ( θ )? How to compute ∇ θ which is composed of ∇ W 1 , ∇ W 2 , ..., ∇ W L − 1 ∈ R n × n , ∇ W L ∈ R n × k ∇ b 1 , ∇ b 2 , ..., ∇ b L − 1 ∈ R n and ∇ b L ∈ R k ? 14/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  15. Module 4.3: Output Functions and Loss Functions 15/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  16. We need to answer two questions How to choose the loss function L ( θ ) ? How to compute ∇ θ which is composed of: ∇ W 1 , ∇ W 2 , ..., ∇ W L − 1 ∈ R n × n , ∇ W L ∈ R n × k ∇ b 1 , ∇ b 2 , ..., ∇ b L − 1 ∈ R n and ∇ b L ∈ R k ? 16/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  17. The choice of loss function depends y i = { 7 . 5 8 . 2 7 . 7 } on the problem at hand We will illustrate this with the help imdb Critics RT of two examples Rating Rating Rating Consider our movie example again but this time we are interested in predicting ratings Neural network with Here y i ∈ R 3 L − 1 hidden layers The loss function should capture how much ˆ y i deviates from y i If y i ∈ R n then the squared error loss can capture this deviation N 3 isActor isDirector L ( θ ) = 1 � � . . . . . . . . . . y ij − y ij ) 2 (ˆ Damon Nolan N i =1 j =1 x i 17/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  18. h L = ˆ y = f ( x ) A related question: What should the output function ‘ O ’ be if y i ∈ R ? More specifically, can it be the logistic a 3 function? W 3 b 3 h 2 No, because it restricts ˆ y i to a value between 0 & 1 but we want ˆ y i ∈ R So, in such cases it makes sense to a 2 W 2 have ‘ O ’ as linear function b 2 h 1 f ( x ) = h L = O ( a L ) = W O a L + b O a 1 W 1 b 1 ˆ y i = f ( x i ) is no longer bounded between 0 and 1 x 1 x 2 x n 18/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  19. Intentionally left blank 19/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  20. Now let us consider another problem y = [1 0 0 0] for which a different loss function would be appropriate Apple Mango Orange Banana Suppose we want to classify an image into 1 of k classes Here again we could use the squared Neural network with error loss to capture the deviation L − 1 hidden layers But can you think of a better function? 20/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

  21. Notice that is a probability y y = [1 0 0 0] distribution Apple Mango Orange Therefore we should also ensure that Banana ˆ y is a probability distribution What choice of the output activation ‘ O ’ will ensure this ? Neural network with a L = W L h L − 1 + b L L − 1 hidden layers e a L,j ˆ y j = O ( a L ) j = � k i =1 e a L,i O ( a L ) j is the j th element of ˆ y and a L,j is the j th element of the vector a L . This function is called the softmax function 21/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 h L = ˆ y = f ( x )

Recommend


More recommend