Lecture 12: − Computational Graph − Backpropagation Aykut Erdem March 2016 Hacettepe University
Administrative • Assignment 2 due March 20, 2016! • Midterm exam on Thursday, March 24, 2016 − You are responsible from the beginning till the end of this class − You can prepare and bring a full-page copy sheet (A4-paper, both sides) to the exam. • Assignment 3 will be out soon! − It is due April 7, 2016 − You will implement a 2-layer Neural Network 2
Last time… Multilayer Perceptron • Layer Representation y W 4 y i = W i x i x i +1 = σ ( y i ) x4 W 3 • (typically) iterate between x3 linear mapping Wx and W 2 nonlinear function x2 • Loss function l ( y, y i ) W 1 to measure quality of estimate so far slide by Alex Smola x1 3
Last time… Forward Pass • Output of the network can be written as: D X X h j ( x ) = f ( v j 0 + x i v ji ) i =1 J slide by Raquel Urtasun, Richard Zemel, Sanja Fidler X o k ( x ) = g ( w k 0 + h j ( x ) w kj ) j =1 (j indexing hidden units, k indexing the output units, D number of inputs) • Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU) 1 + exp( − z ) , tanh ( z ) = exp( z ) − exp( − z ) 1 σ ( z ) = exp( z ) + exp( − z ) , ReLU ( z ) = max(0 , z ) 4
Last time… Forward Pass in Python • Example code for a forward pass for a 3-layer network in Python: slide by Raquel Urtasun, Richard Zemel, Sanja Fidler • Can be implemented e ffi ciently using matrix operations • Example above: W 1 is matrix of size 4 × 3, W 2 is 4 × 4. What about biases and W 3 ? 5 [http://cs231n.github.io/neural-networks-1/]
Today • Backpropagation and Neural Networks • Tips and Tricks 6
Backpropagation and Neural Networks 7
Recap: Loss function/Optimization TODO: 1. Define a loss function that quantifies our unhappiness with the scores across the training 3.42 -3.45 -0.51 data. -8.87 4.64 6.04 0.09 2.65 5.31 2. Come up with a way of 2.9 5.1 -4.22 efficiently finding the 4.48 2.64 parameters that minimize the -4.19 loss function. (optimization) 8.02 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 5.55 3.58 3.78 -4.34 4.49 1.06 -1.5 -4.37 -0.36 -4.79 -2.09 -0.72 6.14 -2.93 We defined a (linear) score function: 8
Softmax Classifier (Multinomial 9 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 10 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 11 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 12 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 13 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 14 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 15 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 16 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 17 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial 18 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Optimization 19
20 Gradient Descent slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Mini-batch Gradient Descent • only use a small portion of the training set to compute the gradient slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 21
Mini-batch Gradient Descent • only use a small portion of the training set to compute the gradient slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, … ) 22
The e ff ects of di ff erent update form formulas slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 23 (image credits to Alec Radford)
24 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
25 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Back-propagation 26
27 L Computational Graph + R hinge loss s (scores) * x W slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Convolutional Network (AlexNet) input image weights slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson loss 28
29 Neural Turing Machine loss input tape slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
30 e.g. x = -2, y = 5, z = -4 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
31 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
32 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
33 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
34 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
35 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
36 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
37 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
38 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
39 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
40 Chain rule: e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
41 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
42 Chain rule: e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
43 f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
44 “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
45 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
46 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
47 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
48 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
49 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
50 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
51 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
52 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
53 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
54 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
55 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
56 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
57 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
58 (-1) * (-0.20) = 0.20 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
59 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Another example: [local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 60
61 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Another example: [local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 62
63 sigmoid function sigmoid gate slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
sigmoid function slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson sigmoid gate (0.73) * (1 - 0.73) = 0.2 64
Patterns in backward flow • add gate: gradient distributor • max gate: gradient router • mul gate: gradient… “switcher”? slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 65
66 Gradients add at branches + slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Implementation: forward/backward API Graph (or Net) object. (Rough pseudo code) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 67
Implementation: forward/backward API x z * y (x,y,z are scalars) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 68
Recommend
More recommend