cs 4803 7643 deep learning
play

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward autodifferentiation (Beginning of) Convolutional neural networks Zsolt Kira Georgia Tech Administrivia PS0 released mean of 20.7


  1. CS 4803 / 7643: Deep Learning Topics: – Specifying Layers – Forward & Backward autodifferentiation – (Beginning of) Convolutional neural networks Zsolt Kira Georgia Tech

  2. Administrivia • PS0 released – mean of 20.7 – standard deviation of 3.4 – median of 21 – max of 25 – See me if you did not pass • PS1/HW1 out • Start thinking about project topics/teams – More details on project next time (C) Dhruv Batra & Zsolt Kira 2

  3. Recap from last time (C) Dhruv Batra & Zsolt Kira 3

  4. Gradient Descent Pseudocode for i in {0,…,num_epochs}: for x, y in data: Some design decisions: • How many examples to use to calculate gradient per iteration? • What should alpha (learning rate) be? • Should it be constant throughout? • How many epochs to run to?

  5. Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra & Zsolt Kira 5 Slide Credit: Marc'Aurelio Ranzato

  6. Key Computation: Back-Prop (C) Dhruv Batra & Zsolt Kira 6 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

  7. Neural Network Training • Step 1: Compute Loss on mini-batch [F-Pass] (C) Dhruv Batra & Zsolt Kira 7 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

  8. Neural Network Training • Step 1: Compute Loss on mini-batch [F-Pass] • Step 2: Compute gradients wrt parameters [B-Pass] (C) Dhruv Batra & Zsolt Kira 8 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

  9. General Flow Graphs “Deep Learning” book, Bengio

  10. 10

  11. 11

  12. Jacobian of ReLU 4096-d 4096-d g(x) = max(0,x) input vector output vector (elementwise) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  13. Jacobian of ReLU 4096-d 4096-d g(x) = max(0,x) input vector output vector (elementwise) Q: what is the size of the Jacobian matrix? 13 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  14. Jacobian of ReLU 4096-d 4096-d g(x) = max(0,x) input vector output vector (elementwise) Q: what is the size of the Jacobian matrix? [4096 x 4096!] 14 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  15. Jacobian of ReLU 4096-d 4096-d g(x) = max(0,x) input vector output vector (elementwise) Q: what is the Q2: what does it size of the look like? Jacobian matrix? [4096 x 4096!] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  16. Plan for Today • Specifying Layers • Forward & Backward auto-differentiation • (Beginning of) Convolutional neural networks (C) Dhruv Batra & Zsolt Kira 17

  17. Deep Learning = Differentiable Programming • Computation = Graph – Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering • What do we need to do? – Generic code for representing the graph of modules – Specify modules (both forward and backward function) (C) Dhruv Batra & Zsolt Kira 18

  18. Modularized implementation: forward / backward API Graph (or Net) object (rough psuedo code) 19 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  19. Modularized implementation: forward / backward API x z * y (x,y,z are scalars) 20 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  20. Modularized implementation: forward / backward API x z * y (x,y,z are scalars) 21 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  21. Example: Caffe layers Caffe is licensed under BSD 2-Clause 22 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  22. Caffe Sigmoid Layer * top_diff (chain rule) Caffe is licensed under BSD 2-Clause 23 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  23. Deep Learning = Differentiable Programming • Computation = Graph – Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering • Auto-Diff – A family of algorithms for implementing chain-rule on computation graphs (C) Dhruv Batra & Zsolt Kira 24

  24. Forward mode vs Reverse Mode • Key Computations (C) Dhruv Batra & Zsolt Kira 25

  25. Forward mode AD g 26

  26. Reverse mode AD g 27

  27. Example: Forward mode AD + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 28

  28. Example: Forward mode AD + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 29

  29. Example: Forward mode AD + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 30

  30. Example: Forward mode AD Q: What happens if there’s another input variable x 3 ? + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 31

  31. Example: Forward mode AD Q: What happens if there’s another input variable x 3 ? A: more sophisticated graph; + d “forward props” for d variables sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 32

  32. Example: Forward mode AD Q: What happens if there’s another output variable f 2 ? + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 33

  33. Example: Forward mode AD Q: What happens if there’s another output variable f 2 ? A: more sophisticated graph; + single “forward prop” sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 34

  34. Example: Reverse mode AD + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 35

  35. Example: Reverse mode AD + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 36

  36. Gradients add at branches + Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  37. Example: Reverse mode AD Q: What happens if there’s another input variable x 3 ? + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 38

  38. Example: Reverse mode AD Q: What happens if there’s another input variable x 3 ? A: more sophisticated graph; + single “backward prop” sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 39

  39. Example: Reverse mode AD Q: What happens if there’s another output variable f 2 ? + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 40

  40. Example: Reverse mode AD Q: What happens if there’s another output variable f 2 ? A: more sophisticated graph; + c “backward props” for c vars sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 41

  41. Forward mode vs Reverse Mode • x  Graph  L • Intuition of Jacobian (C) Dhruv Batra & Zsolt Kira 42

  42. Forward mode vs Reverse Mode • What are the differences? • Which one is faster to compute? – Forward or backward? (C) Dhruv Batra & Zsolt Kira 43

  43. Forward mode vs Reverse Mode • What are the differences? • Which one is faster to compute? – Forward or backward? • Which one is more memory efficient (less storage)? – Forward or backward? + + sin( ) sin( ) * * x 1 x 2 x 1 x 2 (C) Dhruv Batra & Zsolt Kira 44

  44. Practical Note 2: Software Frameworks A few weeks ago! +Keras Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  45. PyTorch

  46. Plan for Today (Cont.) • Specifying Layers • Forward & Backward auto-differentiation • (Beginning of) Convolutional neural networks – What is a convolution? – FC vs Conv Layers (C) Dhruv Batra & Zsolt Kira 48

  47. Recall: Linear Classifier 3072x1 f(x,W) = Wx + b 10x1 Image 10x1 10x3072 10 numbers giving f( x , W ) class scores Array of 32x32x3 numbers W (3072 numbers total) parameters or weights Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  48. Example with an image with 4 pixels, and 3 classes (cat/dog/ship) Stretch pixels into column 56 0.2 -0.5 0.1 2.0 1.1 -96.8 Cat score 56 231 231 + = 1.5 1.3 2.1 0.0 3.2 437.9 Dog score 24 2 24 0 0.25 0.2 -0.3 -1.2 61.95 Ship score Input image 2 b W 50 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  49. Recall: (Fully-Connected) Neural networks ( Before ) Linear score function: ( Now ) 2-layer Neural Network x h s W1 W2 10 3072 100 51 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  50. Convolutional Neural Networks (without the brain stuff) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  51. Fully Connected Layer Example: 200x200 image 40K hidden units ~2B parameters !!! - Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. 53 Slide Credit: Marc'Aurelio Ranzato

  52. Locally Connected Layer Example: 200x200 image 40K hidden units “Filter” size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face 54 Slide Credit: Marc'Aurelio Ranzato recognition).

  53. Locally Connected Layer STATIONARITY? Statistics similar at all locations 55 Slide Credit: Marc'Aurelio Ranzato

  54. Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels 56 Slide Credit: Marc'Aurelio Ranzato

  55. What filter to use?

  56. Discrete convolution • Discrete Convolution! • Very similar to correlation but associative 1D Convolution 2D Convolution Filter

  57. A note on sizes m N-m +1 N m N N-m +1 Filter Input Output MATLAB to the rescue! • conv2(x,w, ‘valid’)

  58. Convolutions! • Math vs. CS vs. programming viewpoints (C) Dhruv Batra & Zsolt Kira 60

Recommend


More recommend