Calculus Refresher: Basic rules of calculus For any differentiable function with derivative �� �� the following must hold for sufficiently small For any differentiable function � � � with partial derivatives Both by the �� �� �� definition �� � �� � �� � the following must hold for sufficiently small � � � � 25
Calculus Refresher: Chain rule For any nested function Check – we can confirm that : 26
Calculus Refresher: Distributed Chain rule Check: Let � � � � � � � � � 27 � � �
Calculus Refresher: Distributed Chain rule Check: � � � � � � � � � 28 � � �
Distributed Chain Rule: Influence Diagram � � � � � � • affects through each of 29
Distributed Chain Rule: Influence Diagram � � � � � � � � � � • Small perturbations in cause small perturbations in each of each of which individually additively perturbs 30
Returning to our problem • How to compute 31
A first closer look at the network • Showing a tiny 2-input network for illustration – Actual network would have many more neurons and inputs 32
A first closer look at the network + + 𝑔(. ) 𝑔(. ) + 𝑔(. ) + + 𝑔(. ) 𝑔(. ) • Showing a tiny 2-input network for illustration – Actual network would have many more neurons and inputs • Explicitly separating the weighted sum of inputs from the activation 33
A first closer look at the network (�) (�) �,� + �,� + (�) (�) (�) �,� �,� �,� + (�) (�) (�) �,� �,� �,� (�) (�) + + �,� �,� (�) (�) (�) (�) (�) �,� �,� �,� �,� �,� • Showing a tiny 2-input network for illustration – Actual network would have many more neurons and inputs • Expanded with all weights shown • Lets label the other variables too… 34
Computing the derivative for a single input (�) (�) �,� �,� (�) (�) (�) (�) � � + + � � 1 2 (�) (�) (�) �,� �,� �,� (�) + � Div 3 (�) (�) �,� (�) �,� �,� (�) (�) (�) (�) (�) �,� + + � � � �,� 1 2 (�) � (�) �,� (�) (�) (�) (�) �,� �,� �,� �,� 35
Computing the derivative for a single input 𝒆𝑬𝒋𝒘(𝒁,𝒆) What is: (�) (�) (�) 𝒆� �,� �,� �,� (�) (�) (�) (�) � � + + � � 1 2 (�) (�) (�) �,� �,� �,� (�) + � Div 3 (�) (�) �,� (�) �,� �,� (�) (�) (�) (�) (�) �,� + + � � � �,� 1 2 (�) � (�) �,� (�) (�) (�) (�) �,� �,� �,� �,� 36
Computing the gradient • Note: computation of the derivative requires (�) �,� intermediate and final output values of the network in response to the input 37
The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 We will refer to the process of computing the output from an input as the forward pass We will illustrate the forward pass in the following slides 38
The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 Setting � (�) � for notational convenience (�) and � Assuming (�) (�) -- assuming the bias is a weight and extending �� � the output of every layer by a constant 1, to account for the biases 39
The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 40
The “forward pass” y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) � �� � � 41
y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) (�) (�) � � � � �� � � 42
y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) � � � � �� � � �� � � � 43
y (0) z (1) y (1) z (2) y (2) z (3) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) (�) (�) � � � � � � � �� � � �� � � � 44
y (0) z (3) z (1) y (1) z (2) y (2) y (3) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) (�) (�) � � � � � � � �� � � �� � � � (�) (�) (�) � �� � 45 �
y (0) z (3) z (1) y (3) y (1) z (2) y (2) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (�) (�) (�) (�) (�) (�) (�) (�) � � � � � � � �� � � �� � � � (�) (�) (�) (�) (�) � �� � � � � 46 �
y (0) z (3) y (N-1) z (1) y (3) y (1) z (2) y (2) z (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 (�) (�) (���) (���) (���) (�) (�) � �� � ��� � � � � 47
Forward Computation y (0) z (3) z (1) y (1) z (2) y (3) y (2) z (N-1) y (N-1) ��� � � � z (N) y (N) f N � � � ��� f N ��� � � � ��� � � � 1 1 1 1 ITERATE FOR k = 1:N for j = 1:layer-width 48
Forward “Pass” • Input: dimensional vector • Set: , is the width of the 0 th (input) layer – – ; • For layer – For D k is the size of the kth layer (�) � ��� (�) (���) • � ��� �,� � (�) (�) • � � � • Output: – 49
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We have computed all these intermediate values in the forward computation We must remember them – we will need them to compute the derivatives 50
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 First, we compute the divergence between the output of the net y = y (N) and the desired output 51
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We then compute � (�) the derivative of the divergence w.r.t. the final output of the network y (N) 52
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We then compute � (�) the derivative of the divergence w.r.t. the final output of the network y (N) We then compute � (�) the derivative of the divergence w.r.t. the pre-activation affine combination z (N) using the chain rule 53
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Continuing on, we will compute � (�) the derivative of the divergence with respect to the weights of the connections to the output layer 54
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Continuing on, we will compute � (�) the derivative of the divergence with respect to the weights of the connections to the output layer Then continue with the chain rule to compute � (���) the derivative of the divergence w.r.t. the output of the N-1th layer 55
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 56
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 57
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 58
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 59
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 60
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 61
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown 62
Backward Gradient Computation • Lets actually see the math.. 63
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 64
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 The derivative w.r.t the actual output of the final layer of the network is simply the derivative w.r.t to the output of the network 65
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 66
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Already computed 67
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � (�) � 1 1 1 � � Derivative of activation function 68
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � (�) � 1 1 1 � � Derivative of activation function Computed in forward pass 69
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 70
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 71
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) � (�) (�) (�) �� �� � 72
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) � (�) (�) (�) Just computed �� �� � 73
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Because (���) (�) (�) (�) (���) � � �� � � (�) (�) (�) �� �� � 74
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Because (���) (�) (�) (�) (���) � � �� � � (�) (�) (�) �� �� � Computed in forward pass 75
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (���) � (�) (�) �� � 76
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (���) For the bias term � (���) � (�) (�) �� � 77
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) � (���) (���) (�) � � � � 78
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) � Already computed (���) (���) (�) � � � � 79
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 Because (�) (�) (�) (�) (���) � �� � �� � (���) (���) (�) � � � � 80
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) �� (���) (�) � � � 81
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) �� (���) (�) � � � 82
Computing derivatives y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (���) � ��� � (���) (���) � � 83
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (���) For the bias term � (���) � (���) (���) �� � 84
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (���) �� (���) (���) � � � 85
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (���) � ��� � (���) (���) � � 86
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (�) �� (�) (�) � � � 87
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 We continue our way backwards in the order shown (�) � � � (�) (�) � � 88
y (0) y (N-2) y (N-1) z (1) y (1) z (N-2) z (N-1) ��� ��� � z (N) y (N) f N Div(Y,d) � ��� ��� f N ��� ��� � ��� ��� � 1 1 1 (�) � (�) (�) We continue our way backwards in the order shown �� � 89
Gradients: Backward Computation z (k-1) y (k-1) z (k) y (k) z (N-1) y (N-1) z (N) y (N) f N Div(Y,d) Div(Y,d) f N Figure assumes, but does not show the “1” bias nodes Initialize: Gradient w.r.t network output (���) (�) � � �� � (�) (���) (�) (�) (�) � � � � � � � (�) (�) � � (���) (���) � � (�) (�) 90 �� � � �
Backward Pass • Output layer (N) : – For � ���� (�) = ����(�,�) • �� � �� � ���� (�) = ���� � 𝑨 � (�) • (�) 𝑔 � �� � �� � • For layer – For � ���� ���� (���) • (�) = ∑ 𝑥 �� � (���) �� � �� � ���� (�) = ���� � 𝑨 � (�) • (�) 𝑔 � �� � �� � ���� ���� (�) • (���) = 𝑧 � (���) for 𝑘 = 1 … 𝐸 � �� �� �� � ���� (�) ���� – (�) for � (�) � �� �� �� � 91
Backward Pass Called “Backpropagation” because • Output layer (N) : the derivative of the loss is – For � propagated “backwards” through the network ���� (�) = ����(�,�) • �� � �� � ���� (�) = ���� � 𝑨 � (�) • (�) 𝑔 � �� � �� � • Very analogous to the forward pass: For layer – For � Backward weighted combination of next layer ���� ���� (���) • (�) = ∑ 𝑥 �� � (���) �� � �� � Backward equivalent of activation ���� (�) = ���� � 𝑨 � (�) • (�) 𝑔 � �� � �� � ���� ���� (�) • (���) = 𝑧 � (���) for 𝑘 = 1 … 𝐸 � �� �� �� � ���� (�) ���� – (�) for � (�) � �� �� �� � 92
Backward Pass ����(�,�) Using notation etc (overdot represents derivative of w.r.t variable) �� Called “Backpropagation” because • Output layer (N) : the derivative of the loss is – For propagated “backwards” through � the network ���� (�) • � �� � (�) (�) (�) � • � � � � • For layer Very analogous to the forward pass: – For � Backward weighted combination (�) (���) (���) of next layer • � � �� � Backward equivalent of activation (�) (�) (�) � • � � � � ���� (�) (���) for • � (���) � � �� �� ���� (�) (�) for – � (�) � � �� �� 93
For comparison: the forward pass again • Input: dimensional vector • Set: , is the width of the 0 th (input) layer – – ; • For layer – For (�) � � (�) (���) • � ��� �,� � (�) (�) • � � � • Output: – 94
Special cases • Have assumed so far that 1. The computation of the output of one neuron does not directly affect computation of other neurons in the same (or previous) layers 2. Inputs to neurons only combine through weighted addition 3. Activations are actually differentiable – All of these conditions are frequently not applicable • Will not discuss all of these in class, but explained in slides – Will appear in quiz. Please read the slides 95
Special Case 1. Vector activations y (k-1) z (k) y (k) y (k-1) z (k) y (k) • Vector activations: all outputs are functions of all inputs 96
Special Case 1. Vector activations y (k-1) y (k-1) z (k) z (k) y (k) y (k) Scalar activation: Modifying a Vector activation: Modifying a only changes corresponding potentially changes all, 97
“Influence” diagram y (k-1) y (k-1) z (k) y (k) z (k) y (k) Scalar activation: Each Vector activation: Each influences one influences all, 98
The number of outputs y (k-1) y (k-1) z (k) y (k) z (k) y (k) • Note: The number of outputs (y (k) ) need not be the same as the number of inputs (z (k) ) • May be more or fewer 99
Scalar Activation: Derivative rule y (k-1) z (k) y (k) • In the case of scalar activation functions, the derivative of the error w.r.t to the input to the unit is a simple product of derivatives 100
Recommend
More recommend