neural networks learning the network part 3
play

Neural Networks Learning the network: Part 3 11-785, Fall 2020 - PowerPoint PPT Presentation

Neural Networks Learning the network: Part 3 11-785, Fall 2020 Lecture 5 1 Recap : Training the network Given a training set of input-output pairs Minimize the following function w.r.t This is problem of function minimization


  1. Calculus Refresher: Basic rules of calculus For any differentiable function with derivative �� �� the following must hold for sufficiently small For any differentiable function � � � with partial derivatives Both by the �� �� �� definition �� � �� � �� � the following must hold for sufficiently small � � � � 25

  2. Calculus Refresher: Chain rule For any nested function Check – we can confirm that : 26

  3. Calculus Refresher: Distributed Chain rule Check: Let � � � � � � � � � 27 � � �

  4. Calculus Refresher: Distributed Chain rule Check: � � � � � � � � � 28 � � �

  5. Distributed Chain Rule: Influence Diagram � � � � � � • affects through each of 29

  6. Distributed Chain Rule: Influence Diagram � � � � � � � � � � • Small perturbations in cause small perturbations in each of each of which individually additively perturbs 30

  7. Returning to our problem • How to compute 31

  8. A first closer look at the network • Showing a tiny 2-input network for illustration – Actual network would have many more neurons and inputs 32

  9. A first closer look at the network + + 𝑔(. ) 𝑔(. ) + 𝑔(. ) + + 𝑔(. ) 𝑔(. ) • Showing a tiny 2-input network for illustration – Actual network would have many more neurons and inputs • Explicitly separating the weighted sum of inputs from the activation 33

  10. A first closer look at the network (�) (�) �,� + �,� + (�) (�) (�) �,� �,� �,� + (�) (�) (�) �,� �,� �,� (�) (�) + + �,� �,� (�) (�) (�) (�) (�) �,� �,� �,� �,� �,� • Showing a tiny 2-input network for illustration – Actual network would have many more neurons and inputs • Expanded with all weights shown • Lets label the other variables too… 34

  11. Computing the derivative for a single input (�) (�) �,� �,� (�) (�) (�) (�) � � + + � � 1 2 (�) (�) (�) �,� �,� �,� (�) + � Div 3 (�) (�) �,� (�) �,� �,� (�) (�) (�) (�) (�) �,� + + � � � �,� 1 2 (�) � (�) �,� (�) (�) (�) (�) �,� �,� �,� �,� 35

  12. Computing the derivative for a single input 𝒆𝑬𝒋𝒘(𝒁,𝒆) What is: (�) (�) (�) 𝒆� �,� �,� �,� (�) (�) (�) (�) � � + + � � 1 2 (�) (�) (�) �,� �,� �,� (�) + � Div 3 (�) (�) �,� (�) �,� �,� (�) (�) (�) (�) (�) �,� + + � � � �,� 1 2 (�) � (�) �,� (�) (�) (�) (�) �,� �,� �,� �,� 36

  13. Computing the gradient • Note: computation of the derivative requires (�) �,� intermediate and final output values of the network in response to the input 37

  14. The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 We will refer to the process of computing the output from an input as the forward pass We will illustrate the forward pass in the following slides 38

  15. The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 Setting � (�) � for notational convenience (�) and � Assuming (�) (�) -- assuming the bias is a weight and extending �� � the output of every layer by a constant 1, to account for the biases 39

  16. The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 40

  17. The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) � �� � � 41

  18. y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) (�) (�) � � � � �� � � 42

  19. y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) � � � � �� � � �� � � � 43

  20. y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) (�) (�) � � � � � � � �� � � �� � � � 44

  21. y (0) z (3) z (1) y (1) z (2) y (2) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) (�) (�) � � � � � � � �� � � �� � � � (�) (�) (�) � �� � 45 �

  22. y (0) z (3) z (1) y (3) y (1) z (2) y (2) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) (�) (�) � � � � � � � �� � � �� � � � (�) (�) (�) (�) (�) � �� � � � � 46 �

  23. y (0) z (3) y (N-1) z (1) y (3) y (1) z (2) y (2) z (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (���) (���) (���) (�) (�) � �� � ��� � � � � 47

  24. Forward Computation y (0) z (3) z (1) y (1) z (2) y (3) y (2) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 ITERATE FOR k = 1:N for j = 1:layer-width 48

  25. Forward “Pass” • Input: dimensional vector • Set: , is the width of the 0 th (input) layer – – ; • For layer – For D k is the size of the kth layer (�) � ��� (�) (���) • � ��� �,� � (�) (�) • � � � • Output: – 49

  26. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We have computed all these intermediate values in the forward computation We must remember them – we will need them to compute the derivatives 50

  27. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 First, we compute the divergence between the output of the net y = y (N) and the desired output 51

  28. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We then compute � (�) the derivative of the divergence w.r.t. the final output of the network y (N) 52

  29. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We then compute � (�) the derivative of the divergence w.r.t. the final output of the network y (N) We then compute � (�) the derivative of the divergence w.r.t. the pre-activation affine combination z (N) using the chain rule 53

  30. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Continuing on, we will compute � (�) the derivative of the divergence with respect to the weights of the connections to the output layer 54

  31. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Continuing on, we will compute � (�) the derivative of the divergence with respect to the weights of the connections to the output layer Then continue with the chain rule to compute � (���) the derivative of the divergence w.r.t. the output of the N-1th layer 55

  32. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 56

  33. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 57

  34. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 58

  35. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 59

  36. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 60

  37. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 61

  38. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 62

  39. Backward Gradient Computation • Lets actually see the math.. 63

  40. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 64

  41. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 The derivative w.r.t the actual output of the final layer of the network is simply the derivative w.r.t to the output of the network 65

  42. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 66

  43. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Already computed 67

  44. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � (�) � 1 1 1 � � Derivative of activation function 68

  45. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � (�) � 1 1 1 � � Derivative of activation function Computed in forward pass 69

  46. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 70

  47. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 71

  48. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) � (�) (�) (�) �� �� � 72

  49. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) � (�) (�) (�) Just computed �� �� � 73

  50. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Because (���) (�) (�) (�) (���) � � �� � � (�) (�) (�) �� �� � 74

  51. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Because (���) (�) (�) (�) (���) � � �� � � (�) (�) (�) �� �� � Computed in forward pass 75

  52. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (���) � (�) (�) �� � 76

  53. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (���) For the bias term � (���) � (�) (�) �� � 77

  54. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) � (���) (���) (�) � � � � 78

  55. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) � Already computed (���) (���) (�) � � � � 79

  56. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Because (�) (�) (�) (�) (���) � �� � �� � (���) (���) (�) � � � � 80

  57. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) �� (���) (�) � � � 81

  58. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) �� (���) (�) � � � 82

  59. Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (���) � ��� � (���) (���) � � 83

  60. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (���) For the bias term � (���) � (���) (���) �� � 84

  61. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (���) �� (���) (���) � � � 85

  62. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (���) � ��� � (���) (���) � � 86

  63. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (�) �� (�) (�) � � � 87

  64. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (�) � � � (�) (�) � � 88

  65. y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) � (�) (�) We continue our way backwards in the order shown �� � 89

  66. Gradients: Backward Computation z (k-1) y (k-1) z (k) y (k) z (N-1) y (N-1) z (N) y (N) f N Div(Y,d) Div(Y,d) f N Figure assumes, but does not show the “1” bias nodes Initialize: Gradient w.r.t network output (���) (�) � � �� � (�) (���) (�) (�) (�) � � � � � � � (�) (�) � � (���) (���) � � (�) (�) 90 �� � � �

  67. Backward Pass • Output layer (N) : – For � ���� (�) = ����(�,�) • �� � �� � ���� (�) = ���� � 𝑨 � (�) • (�) 𝑔 � �� � �� � • For layer – For � ���� ���� (���) • (�) = ∑ 𝑥 �� � (���) �� � �� � ���� (�) = ���� � 𝑨 � (�) • (�) 𝑔 � �� � �� � ���� ���� (�) • (���) = 𝑧 � (���) for 𝑘 = 1 … 𝐸 � �� �� �� � ���� (�) ���� – (�) for � (�) � �� �� �� � 91

  68. Backward Pass Called “Backpropagation” because • Output layer (N) : the derivative of the loss is – For � propagated “backwards” through the network ���� (�) = ����(�,�) • �� � �� � ���� (�) = ���� � 𝑨 � (�) • (�) 𝑔 � �� � �� � • Very analogous to the forward pass: For layer – For � Backward weighted combination of next layer ���� ���� (���) • (�) = ∑ 𝑥 �� � (���) �� � �� � Backward equivalent of activation ���� (�) = ���� � 𝑨 � (�) • (�) 𝑔 � �� � �� � ���� ���� (�) • (���) = 𝑧 � (���) for 𝑘 = 1 … 𝐸 � �� �� �� � ���� (�) ���� – (�) for � (�) � �� �� �� � 92

  69. Backward Pass ����(�,�) Using notation etc (overdot represents derivative of w.r.t variable) �� Called “Backpropagation” because • Output layer (N) : the derivative of the loss is – For propagated “backwards” through � the network ���� (�) • � �� � (�) (�) (�) � • � � � � • For layer Very analogous to the forward pass: – For � Backward weighted combination (�) (���) (���) of next layer • � � �� � Backward equivalent of activation (�) (�) (�) � • � � � � ���� (�) (���) for • � (���) � � �� �� ���� (�) (�) for – � (�) � � �� �� 93

  70. For comparison: the forward pass again • Input: dimensional vector • Set: , is the width of the 0 th (input) layer – – ; • For layer – For (�) � � (�) (���) • � ��� �,� � (�) (�) • � � � • Output: – 94

  71. Special cases • Have assumed so far that 1. The computation of the output of one neuron does not directly affect computation of other neurons in the same (or previous) layers 2. Inputs to neurons only combine through weighted addition 3. Activations are actually differentiable – All of these conditions are frequently not applicable • Will not discuss all of these in class, but explained in slides – Will appear in quiz. Please read the slides 95

  72. Special Case 1. Vector activations y (k-1) z (k) y (k) y (k-1) z (k) y (k) • Vector activations: all outputs are functions of all inputs 96

  73. Special Case 1. Vector activations y (k-1) y (k-1) z (k) z (k) y (k) y (k) Scalar activation: Modifying a Vector activation: Modifying a only changes corresponding potentially changes all, 97

  74. “Influence” diagram y (k-1) y (k-1) z (k) y (k) z (k) y (k) Scalar activation: Each Vector activation: Each influences one influences all, 98

  75. The number of outputs y (k-1) y (k-1) z (k) y (k) z (k) y (k) • Note: The number of outputs (y (k) ) need not be the same as the number of inputs (z (k) ) • May be more or fewer 99

  76. Scalar Activation: Derivative rule y (k-1) z (k) y (k) • In the case of scalar activation functions, the derivative of the error w.r.t to the input to the unit is a simple product of derivatives 100

Recommend


More recommend