Illustration: 3Blue1Brown BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation Aykut Erdem // Hacettepe University // Fall 2019
Last time… Multilayer Perceptron • Layer Representation y W 4 y i = W i x i x i +1 = σ ( y i ) x4 W 3 • (typically) iterate between x3 linear mapping Wx and W 2 nonlinear function x2 • Loss function l ( y, y i ) W 1 to measure quality of estimate so far slide by Alex Smola x1 2
Last time… Forward Pass • Output of the network can be written as: D X X h j ( x ) = f ( v j 0 + x i v ji ) i =1 J slide by Raquel Urtasun, Richard Zemel, Sanja Fidler X o k ( x ) = g ( w k 0 + h j ( x ) w kj ) j =1 (j indexing hidden units, k indexing the output units, D number of inputs) • Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU) 1 + exp( − z ) , tanh ( z ) = exp( z ) − exp( − z ) 1 σ ( z ) = exp( z ) + exp( − z ) , ReLU ( z ) = max(0 , z ) 3
Last time… Forward Pass in Python • Example code for a forward pass for a 3-layer network in Python: slide by Raquel Urtasun, Richard Zemel, Sanja Fidler • Can be implemented e ffi ciently using matrix operations • Example above: W 1 is matrix of size 4 × 3, W 2 is 4 × 4. What about biases and W 3 ? 4 [http://cs231n.github.io/neural-networks-1/]
Backpropagation 5
Recap: Loss function/Optimization TODO: 1. Define a loss function that quantifies our unhappiness with the scores across the training 3.42 -3.45 -0.51 data. -8.87 4.64 6.04 0.09 2.65 5.31 2. Come up with a way of 2.9 5.1 -4.22 efficiently finding the 4.48 2.64 parameters that minimize the -4.19 loss function. (optimization) 8.02 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 5.55 3.58 3.78 -4.34 4.49 1.06 -1.5 -4.37 -0.36 -4.79 -2.09 -0.72 6.14 -2.93 We defined a (linear) score function: 6
Softmax Classifier (Multinomial 7 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 8 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 9 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 10 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 11 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 12 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 13 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 14 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 15 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 16 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Optimization 17
18 Gradient Descent slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Mini-batch Gradient Descent • only use a small portion of the training set to compute the gradient slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 19
Mini-batch Gradient Descent • only use a small portion of the training set to compute the gradient slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …) 20
The e ff ects of di ff erent update form formulas slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 21 (image credits to Alec Radford)
22 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
23 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
24 L Computational Graph + R hinge loss s (scores) * x W slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Convolutional Network (AlexNet) input image weights slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson loss 25
26 Neural Turing Machine loss input tape slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
27 e.g. x = -2, y = 5, z = -4 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
28 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
29 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
30 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
31 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
32 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
33 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
34 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
35 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
36 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
37 Chain rule: e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
38 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
39 Chain rule: e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
40 f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
41 “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
42 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
43 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
44 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
45 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
46 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
47 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
48 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
49 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
50 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
51 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
52 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
53 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
54 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
55 (-1) * (-0.20) = 0.20 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
56 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Another example: [local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 57
58 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Another example: [local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2 0.40 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 59
60 sigmoid function sigmoid gate 0.40 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
sigmoid function slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 0.40 sigmoid gate (0.73) * (1 - 0.73) = 0.2 61
Patterns in backward flow • add gate: gradient distributor • max gate: gradient router • mul gate: gradient… “switcher”? slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 62
63 Gradients add at branches + slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Implementation: forward/backward API Graph (or Net) object. (Rough pseudo code) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 0.40 64
Implementation: forward/backward API x z * y (x,y,z are scalars) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 65
Implementation: forward/backward API x z * y (x,y,z are scalars) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 66
Recommend
More recommend