CS 7643: Deep Learning Topics: – Computational Graphs – Notation + example – Computing Gradients – Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech
Administrativia • HW1 Released – Due: 09/22 • PS1 Solutions – Coming soon (C) Dhruv Batra 2
Project • Goal – Chance to try Deep Learning – Combine with other classes / research / credits / anything • You have our blanket permission • Extra credit for shooting for a publication – Encouraged to apply to your research (computer vision, NLP, robotics,…) – Must be done this semester. • Main categories – Application/Survey • Compare a bunch of existing algorithms on a new application domain of your interest – Formulation/Development • Formulate a new model or algorithm for a new or old problem – Theory • Theoretically analyze an existing algorithm (C) Dhruv Batra 3
Administrativia • Project Teams Google Doc – https://docs.google.com/spreadsheets/d/1AaXY0JE4lAbHvo DaWlc9zsmfKMyuGS39JAn9dpeXhhQ/edit#gid=0 – Project Title – 1-3 sentence project summary TL;DR – Team member names + GT IDs (C) Dhruv Batra 4
Recap of last time (C) Dhruv Batra 5
How do we compute gradients? • Manual Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 6
Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra 7 Slide Credit: Marc'Aurelio Ranzato
Directed Acyclic Graphs (DAGs) • Exactly what the name suggests – Directed edges – No (directed) cycles – Underlying undirected cycles okay (C) Dhruv Batra 8
Directed Acyclic Graphs (DAGs) • Concept – Topological Ordering (C) Dhruv Batra 9
Directed Acyclic Graphs (DAGs) (C) Dhruv Batra 10
Computational Graphs • Notation #1 f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) (C) Dhruv Batra 11
Computational Graphs • Notation #2 f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) (C) Dhruv Batra 12
Example f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 13
Logistic Regression as a Cascade Given a library of simple functions Compose into a ✓ ◆ 1 − log 1 + e − w | x complicate function | x w (C) Dhruv Batra 14 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Forward mode vs Reverse Mode • Key Computations (C) Dhruv Batra 15
Forward mode AD g 16
Reverse mode AD g 17
Example: Forward mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 18
Example: Forward mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) w 3 = ˙ ˙ w 1 + ˙ w 2 + w 1 = cos( x 1 ) ˙ ˙ w 2 = ˙ ˙ x 1 x 2 + x 1 ˙ x 1 x 2 sin( ) * ˙ ˙ ˙ x 1 x 1 x 2 x 1 x 2 (C) Dhruv Batra 19
Example: Forward mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) w 3 = ˙ ˙ w 1 + ˙ w 2 + w 1 = cos( x 1 ) ˙ ˙ w 2 = ˙ ˙ x 1 x 2 + x 1 ˙ x 1 x 2 sin( ) * ˙ ˙ ˙ x 1 x 1 x 2 x 1 x 2 (C) Dhruv Batra 20
Example: Reverse mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 21
Example: Reverse mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) w 3 = 1 ¯ + w 1 = ¯ ¯ w 3 ¯ w 2 = ¯ w 3 sin( ) * x 1 = ¯ ¯ w 1 cos( x 1 ) ¯ x 1 = ¯ w 2 x 2 ¯ x 2 = ¯ w 2 x 1 x 1 x 2 (C) Dhruv Batra 22
Forward Pass vs Forward mode AD vs Reverse Mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 w 3 = ˙ ˙ w 1 + ˙ w 3 = 1 ¯ w 2 + + w 1 = cos( x 1 ) ˙ ˙ w 2 = ˙ ˙ x 1 x 2 + x 1 ˙ w 1 = ¯ ¯ w 3 ¯ w 2 = ¯ w 3 x 1 x 2 sin( ) sin( ) * * x 1 = ¯ ¯ w 1 cos( x 1 ) x 1 = ¯ ¯ w 2 x 2 ¯ x 2 = ¯ w 2 x 1 ˙ ˙ ˙ x 1 x 1 x 2 x 1 x 2 x 1 x 2 (C) Dhruv Batra 23
Forward mode vs Reverse Mode • What are the differences? • Which one is more memory efficient (less storage)? – Forward or backward? (C) Dhruv Batra 24
Forward mode vs Reverse Mode • What are the differences? • Which one is more memory efficient (less storage)? – Forward or backward? • Which one is faster to compute? – Forward or backward? (C) Dhruv Batra 25
Plan for Today • (Finish) Computing Gradients – Forward mode vs Reverse mode AD – Patterns in backprop – Backprop in FC+ReLU NNs • Convolutional Neural Networks (C) Dhruv Batra 26
Patterns in backward flow Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Patterns in backward flow Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Patterns in backward flow add gate: gradient distributor Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Patterns in backward flow add gate: gradient distributor Q: What is a max gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Patterns in backward flow add gate: gradient distributor max gate: gradient router Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Patterns in backward flow add gate: gradient distributor max gate: gradient router Q: What is a mul gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Patterns in backward flow add gate: gradient distributor max gate: gradient router mul gate: gradient switcher Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Gradients add at branches + Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Duality in Fprop and Bprop FPROP BPROP SUM + COPY + (C) Dhruv Batra 35
Modularized implementation: forward / backward API Graph (or Net) object (rough psuedo code) 36 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Modularized implementation: forward / backward API x z * y (x,y,z are scalars) 37 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Modularized implementation: forward / backward API x z * y (x,y,z are scalars) 38 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Example: Caffe layers Caffe is licensed under BSD 2-Clause 39 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Caffe Sigmoid Layer * top_diff (chain rule) Caffe is licensed under BSD 2-Clause 40 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra 41
(C) Dhruv Batra 42
Key Computation in DL: Forward-Prop (C) Dhruv Batra 43 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Key Computation in DL: Back-Prop (C) Dhruv Batra 44 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Q: what is the size of the Jacobian matrix? 46 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Q: what is the size of the Jacobian matrix? [4096 x 4096!] 47 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) in practice we process an Q: what is the entire minibatch (e.g. 100) size of the of examples at one time: Jacobian matrix? i.e. Jacobian would technically be a [4096 x 4096!] [409,600 x 409,600] matrix :\ Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Q: what is the Q2: what does it size of the look like? Jacobian matrix? [4096 x 4096!] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Jacobians of FC-Layer (C) Dhruv Batra 50
Jacobians of FC-Layer (C) Dhruv Batra 51
Jacobians of FC-Layer (C) Dhruv Batra 52
Convolutional Neural Networks (without the brain stuff) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Fully Connected Layer Example: 200x200 image 40K hidden units ~2B parameters !!! - Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. 54 Slide Credit: Marc'Aurelio Ranzato
Locally Connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). 55 Slide Credit: Marc'Aurelio Ranzato
Locally Connected Layer STATIONARITY? Statistics is similar at different locations Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). 56 Slide Credit: Marc'Aurelio Ranzato
Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels 57 Slide Credit: Marc'Aurelio Ranzato
Convolutions for mathematicians (C) Dhruv Batra 58
Recommend
More recommend