CS 7643: Deep Learning Topics: Computational Graphs Notation + - PowerPoint PPT Presentation

CS 7643: Deep Learning Topics: – Computational Graphs – Notation + example – Computing Gradients – Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech

Administrativia • HW1 Released – Due: 09/22 • PS1 Solutions – Coming soon (C) Dhruv Batra 2

Project • Goal – Chance to try Deep Learning – Combine with other classes / research / credits / anything • You have our blanket permission • Extra credit for shooting for a publication – Encouraged to apply to your research (computer vision, NLP, robotics,…) – Must be done this semester. • Main categories – Application/Survey • Compare a bunch of existing algorithms on a new application domain of your interest – Formulation/Development • Formulate a new model or algorithm for a new or old problem – Theory • Theoretically analyze an existing algorithm (C) Dhruv Batra 3

Administrativia • Project Teams Google Doc – https://docs.google.com/spreadsheets/d/1AaXY0JE4lAbHvo DaWlc9zsmfKMyuGS39JAn9dpeXhhQ/edit#gid=0 – Project Title – 1-3 sentence project summary TL;DR – Team member names + GT IDs (C) Dhruv Batra 4

Recap of last time (C) Dhruv Batra 5

How do we compute gradients? • Manual Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 6

Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra 7 Slide Credit: Marc'Aurelio Ranzato

Directed Acyclic Graphs (DAGs) • Exactly what the name suggests – Directed edges – No (directed) cycles – Underlying undirected cycles okay (C) Dhruv Batra 8

Directed Acyclic Graphs (DAGs) • Concept – Topological Ordering (C) Dhruv Batra 9

Directed Acyclic Graphs (DAGs) (C) Dhruv Batra 10

Computational Graphs • Notation #1 f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) (C) Dhruv Batra 11

Computational Graphs • Notation #2 f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) (C) Dhruv Batra 12

Example f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 13

Logistic Regression as a Cascade Given a library of simple functions Compose into a ✓ ◆ 1 − log 1 + e − w | x complicate function | x w (C) Dhruv Batra 14 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Forward mode vs Reverse Mode • Key Computations (C) Dhruv Batra 15

Forward mode AD g 16

Reverse mode AD g 17

Example: Forward mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 18

Example: Forward mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) w 3 = ˙ ˙ w 1 + ˙ w 2 + w 1 = cos( x 1 ) ˙ ˙ w 2 = ˙ ˙ x 1 x 2 + x 1 ˙ x 1 x 2 sin( ) * ˙ ˙ ˙ x 1 x 1 x 2 x 1 x 2 (C) Dhruv Batra 19

Example: Forward mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) w 3 = ˙ ˙ w 1 + ˙ w 2 + w 1 = cos( x 1 ) ˙ ˙ w 2 = ˙ ˙ x 1 x 2 + x 1 ˙ x 1 x 2 sin( ) * ˙ ˙ ˙ x 1 x 1 x 2 x 1 x 2 (C) Dhruv Batra 20

Example: Reverse mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 21

Example: Reverse mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) w 3 = 1 ¯ + w 1 = ¯ ¯ w 3 ¯ w 2 = ¯ w 3 sin( ) * x 1 = ¯ ¯ w 1 cos( x 1 ) ¯ x 1 = ¯ w 2 x 2 ¯ x 2 = ¯ w 2 x 1 x 1 x 2 (C) Dhruv Batra 22

Forward Pass vs Forward mode AD vs Reverse Mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 w 3 = ˙ ˙ w 1 + ˙ w 3 = 1 ¯ w 2 + + w 1 = cos( x 1 ) ˙ ˙ w 2 = ˙ ˙ x 1 x 2 + x 1 ˙ w 1 = ¯ ¯ w 3 ¯ w 2 = ¯ w 3 x 1 x 2 sin( ) sin( ) * * x 1 = ¯ ¯ w 1 cos( x 1 ) x 1 = ¯ ¯ w 2 x 2 ¯ x 2 = ¯ w 2 x 1 ˙ ˙ ˙ x 1 x 1 x 2 x 1 x 2 x 1 x 2 (C) Dhruv Batra 23

Forward mode vs Reverse Mode • What are the differences? • Which one is more memory efficient (less storage)? – Forward or backward? (C) Dhruv Batra 24

Forward mode vs Reverse Mode • What are the differences? • Which one is more memory efficient (less storage)? – Forward or backward? • Which one is faster to compute? – Forward or backward? (C) Dhruv Batra 25

Plan for Today • (Finish) Computing Gradients – Forward mode vs Reverse mode AD – Patterns in backprop – Backprop in FC+ReLU NNs • Convolutional Neural Networks (C) Dhruv Batra 26

Patterns in backward flow Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor Q: What is a max gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router Q: What is a mul gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router mul gate: gradient switcher Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Gradients add at branches + Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Duality in Fprop and Bprop FPROP BPROP SUM + COPY + (C) Dhruv Batra 35

Modularized implementation: forward / backward API Graph (or Net) object (rough psuedo code) 36 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Modularized implementation: forward / backward API x z * y (x,y,z are scalars) 37 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Modularized implementation: forward / backward API x z * y (x,y,z are scalars) 38 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Example: Caffe layers Caffe is licensed under BSD 2-Clause 39 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Caffe Sigmoid Layer * top_diff (chain rule) Caffe is licensed under BSD 2-Clause 40 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(C) Dhruv Batra 41

(C) Dhruv Batra 42

Key Computation in DL: Forward-Prop (C) Dhruv Batra 43 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Key Computation in DL: Back-Prop (C) Dhruv Batra 44 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Q: what is the size of the Jacobian matrix? 46 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Q: what is the size of the Jacobian matrix? [4096 x 4096!] 47 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) in practice we process an Q: what is the entire minibatch (e.g. 100) size of the of examples at one time: Jacobian matrix? i.e. Jacobian would technically be a [4096 x 4096!] [409,600 x 409,600] matrix :\ Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Q: what is the Q2: what does it size of the look like? Jacobian matrix? [4096 x 4096!] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobians of FC-Layer (C) Dhruv Batra 50

Convolutional Neural Networks (without the brain stuff) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Fully Connected Layer Example: 200x200 image 40K hidden units ~2B parameters !!! - Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. 54 Slide Credit: Marc'Aurelio Ranzato

Locally Connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). 55 Slide Credit: Marc'Aurelio Ranzato

Locally Connected Layer STATIONARITY? Statistics is similar at different locations Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). 56 Slide Credit: Marc'Aurelio Ranzato

Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels 57 Slide Credit: Marc'Aurelio Ranzato

Convolutions for mathematicians (C) Dhruv Batra 58

CS 7643: Deep Learning Topics: Computational Graphs Notation + - PowerPoint PPT Presentation

CS 7643: Deep Learning Topics: Computational Graphs Notation + example Computing Gradients Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech Administrativia HW1 Released Due: 09/22 PS1 Solutions

CS 4803 / 7643: Deep Learning Website: http://www.cc.gatech.edu/classes/AY2020/cs7643_spring/

CS 4803 / 7643: Deep Learning Topics: Image Classification Supervised Learning view

CS 4803 / 7643: Deep Learning Topics: Structured representations with graph networks Zsolt

CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value Iteration)

CS 4803 / 7643: Deep Learning Topics: Moving beyond supervised learning Zsolt Kira Georgia

CS 4803 / 7643: Deep Learning Topic: Reinforcement Learning (RL) Overview Markov

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

CS 4803 / 7643: Deep Learning Guest Lecture: Embeddings and world2vec Feb. 18 th 2020 Ledell Wu

CS 4803 / 7643: Deep Learning Topics: Forward and backward though conv (Beginning) of

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

CS 4803 / 7643: Deep Learning Topics: (Continue) Low-label ML Formulations Zsolt Kira

CS 4803 / 7643: Deep Learning Topics: Application: PointGoal Navigation Trust Region

CS 4803 / 7643: Deep Learning Topics: Low-label ML Formulations Zsolt Kira Georgia Tech

Main Memory by J. Nelson Amaral Types of Memories

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

_world[y:] = [[' '] * XSIZE] #

Sets Sets A Set is an abstract data type representing an unordered Sets are unordered and

following circumcision Tuesday, October 20, 2020 Continuing Medical Education Announcement

Improved Security Analysis and Alternative Solutions Alexandra Boldyreva Nathan Chenette Adam

WoW ! Lets Investigate Ms Lim Meow Hwee, Senior Specialist, Pre-school Education Ms Laurice

24/7 Nursing WoWs every minute counts!" But first what is a 24/7 Nursing

CS 7643: Deep Learning Topics: Computational Graphs Notation + - PowerPoint PPT Presentation

CS 7643: Deep Learning Topics: Computational Graphs Notation + example Computing Gradients Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech Administrativia HW1 Released Due: 09/22 PS1 Solutions

CS 4803 / 7643: Deep Learning Website: http://www.cc.gatech.edu/classes/AY2020/cs7643_spring/

CS 4803 / 7643: Deep Learning Topics: Image Classification Supervised Learning view

CS 4803 / 7643: Deep Learning Topics: Structured representations with graph networks Zsolt

CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value Iteration)

CS 4803 / 7643: Deep Learning Topics: Moving beyond supervised learning Zsolt Kira Georgia

CS 4803 / 7643: Deep Learning Topic: Reinforcement Learning (RL) Overview Markov

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

CS 4803 / 7643: Deep Learning Guest Lecture: Embeddings and world2vec Feb. 18 th 2020 Ledell Wu

CS 4803 / 7643: Deep Learning Topics: Forward and backward though conv (Beginning) of

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward &amp; Backward

CS 4803 / 7643: Deep Learning Topics: (Continue) Low-label ML Formulations Zsolt Kira

CS 4803 / 7643: Deep Learning Topics: Application: PointGoal Navigation Trust Region

CS 4803 / 7643: Deep Learning Topics: Low-label ML Formulations Zsolt Kira Georgia Tech

Main Memory by J. Nelson Amaral Types of Memories

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

_world[y:] = [[' '] * XSIZE] #

Sets Sets A Set is an abstract data type representing an unordered Sets are unordered and

following circumcision Tuesday, October 20, 2020 Continuing Medical Education Announcement

Improved Security Analysis and Alternative Solutions Alexandra Boldyreva Nathan Chenette Adam

WoW ! Lets Investigate Ms Lim Meow Hwee, Senior Specialist, Pre-school Education Ms Laurice

24/7 Nursing WoWs every minute counts!&quot; But first what is a 24/7 Nursing

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

24/7 Nursing WoWs every minute counts!" But first what is a 24/7 Nursing