lecture 12
play

Lecture 12: Computational Graph Backpropagation Aykut Erdem March - PowerPoint PPT Presentation

Lecture 12: Computational Graph Backpropagation Aykut Erdem March 2016 Hacettepe University Administrative Assignment 2 due March 20, 2016! Midterm exam on Thursday, March 24, 2016 You are responsible from the beginning


  1. Lecture 12: − Computational Graph − Backpropagation Aykut Erdem March 2016 Hacettepe University

  2. Administrative • Assignment 2 due March 20, 2016! 
 • Midterm exam on Thursday, March 24, 2016 − You are responsible from the beginning till the end of this class − You can prepare and bring a full-page copy sheet (A4-paper, both sides) to the exam. 
 • Assignment 3 will be out soon! − It is due April 7, 2016 − You will implement a 2-layer Neural Network 2

  3. Last time… 
 Multilayer Perceptron • Layer Representation y W 4 y i = W i x i x i +1 = σ ( y i ) x4 W 3 • (typically) iterate between 
 x3 linear mapping Wx and 
 W 2 nonlinear function x2 • Loss function 
 l ( y, y i ) W 1 to measure quality of 
 estimate so far slide by Alex Smola x1 3

  4. Last time… Forward Pass • Output of the network can be written as: D X X h j ( x ) = f ( v j 0 + x i v ji ) i =1 J slide by Raquel Urtasun, Richard Zemel, Sanja Fidler X o k ( x ) = g ( w k 0 + h j ( x ) w kj ) j =1 (j indexing hidden units, k indexing the output units, D number of inputs) • Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU) 1 + exp( − z ) , tanh ( z ) = exp( z ) − exp( − z ) 1 σ ( z ) = exp( z ) + exp( − z ) , ReLU ( z ) = max(0 , z ) 4

  5. 
 
 
 Last time… Forward Pass in Python • Example code for a forward pass for a 3-layer network in Python: 
 slide by Raquel Urtasun, Richard Zemel, Sanja Fidler • Can be implemented e ffi ciently using matrix operations • Example above: W 1 is matrix of size 4 × 3, W 2 is 4 × 4. What about biases and W 3 ? 5 [http://cs231n.github.io/neural-networks-1/]

  6. Today • Backpropagation and Neural Networks • Tips and Tricks 6

  7. Backpropagation and Neural Networks 7

  8. Recap: Loss function/Optimization TODO: 1. Define a loss function that quantifies our unhappiness with the scores across the training 3.42 -3.45 -0.51 data. 
 -8.87 4.64 6.04 0.09 2.65 5.31 2. Come up with a way of 2.9 5.1 -4.22 efficiently finding the 4.48 2.64 parameters that minimize the -4.19 loss function. (optimization) 8.02 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 5.55 3.58 3.78 -4.34 4.49 1.06 -1.5 -4.37 -0.36 -4.79 -2.09 -0.72 6.14 -2.93 We defined a (linear) score function: 8

  9. Softmax Classifier (Multinomial 9 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  10. Softmax Classifier (Multinomial 10 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  11. Softmax Classifier (Multinomial 11 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  12. Softmax Classifier (Multinomial 12 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  13. Softmax Classifier (Multinomial 13 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  14. Softmax Classifier (Multinomial 14 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  15. Softmax Classifier (Multinomial 15 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  16. Softmax Classifier (Multinomial 16 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  17. Softmax Classifier (Multinomial 17 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  18. Softmax Classifier (Multinomial 18 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  19. Optimization 19

  20. 20 Gradient Descent slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  21. Mini-batch Gradient Descent • only use a small portion of the training set to compute the gradient slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 21

  22. Mini-batch Gradient Descent • only use a small portion of the training set to compute the gradient slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, … ) 22

  23. The e ff ects of di ff erent update form formulas slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 23 (image credits to Alec Radford)

  24. 24 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  25. 25 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  26. Back-propagation 26

  27. 27 L Computational Graph + R hinge loss s (scores) * x W slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  28. Convolutional Network (AlexNet) input image weights slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson loss 28

  29. 29 Neural Turing Machine loss input tape slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  30. 30 e.g. x = -2, y = 5, z = -4 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  31. 31 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  32. 32 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  33. 33 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  34. 34 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  35. 35 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  36. 36 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  37. 37 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  38. 38 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  39. 39 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  40. 40 Chain rule: e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  41. 41 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  42. 42 Chain rule: e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  43. 43 f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  44. 44 “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  45. 45 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  46. 46 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  47. 47 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  48. 48 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  49. 49 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  50. 50 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  51. 51 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  52. 52 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  53. 53 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  54. 54 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  55. 55 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  56. 56 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  57. 57 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  58. 58 (-1) * (-0.20) = 0.20 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  59. 59 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  60. Another example: [local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 60

  61. 61 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  62. Another example: [local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 62

  63. 63 sigmoid function sigmoid gate slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  64. sigmoid function slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson sigmoid gate (0.73) * (1 - 0.73) = 0.2 64

  65. Patterns in backward flow • add gate: gradient distributor • max gate: gradient router • mul gate: gradient… “switcher”? slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 65

  66. 66 Gradients add at branches + slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  67. Implementation: forward/backward API Graph (or Net) object. (Rough pseudo code) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 67

  68. Implementation: forward/backward API x z * y (x,y,z are scalars) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 68

Recommend


More recommend