Total Loss Input Predicted Actual h 0 [ [ [ x 0 0.05 1 [-20, 45], 0.02 0 h 1 o 0 [80, 0], 0.96 1 x 1 [4, 15], 0.35 1 [45, 60], h 2 ] ] ] J ( ✓ ) = 1 X ` ( f ( x ( i ) ; ✓ ) , y ( i ) ) N Actual i Predicted 36
Total Loss Input Predicted Actual h 0 [ [ [ x 0 0.05 1 [-20, 45], 0.02 0 h 1 o 0 [80, 0], 0.96 1 x 1 [4, 15], 0.35 1 [45, 60], h 2 ] ] ] J ( ✓ ) = 1 X ` ( f ( x ( i ) ; ✓ ) , y ( i ) ) N Actual i Predicted 37
Binary Cross Entropy Loss Input Predicted Actual h 0 [ [ [ x 0 0.05 1 [-20, 45], 0.02 0 h 1 o 0 [80, 0], 0.96 1 x 1 [4, 15], 0.35 1 [45, 60], h 2 ] ] ] J cross entropy ( θ ) = 1 y ( i ) log( f ( x ( i ) ; θ )) + (1 − y ( i ) ) log(1 − f ( x ( i ) ; θ ))) X N i • For classification problems with a softmax output layer. • Maximize log-probability of the correct class given an input 38
Binary Cross Entropy Loss Input Predicted Actual h 0 [ [ [ x 0 0.05 1 [-20, 45], 0.02 0 h 1 o 0 [80, 0], 0.96 1 x 1 [4, 15], 0.35 1 [45, 60], h 2 ] ] ] J MSE ( θ ) = 1 f ( x ( i ) ; θ ) − y ( i ) ⌘ 2 ⇣ X N i 39
Training Neural Networks 40
����������������������������������� Training ����������������������������������� X 1 l ( f ( x ( t ) ; θ ) , y ( t ) ) + λ Ω ( θ ) arg min = T θ t ������������� Loss function Regularizer ���������������������������������������������� • Learning is cast as optimization. — For classification problems, we would like to minimize classification error ���������������������������������������������� — Loss function can sometimes be viewed as a surrogate for what we want to optimize (e.g. upper bound) 41
Loss is a function of the model’s parameters ���������� �������� �������������������������� ���������� �������� �������������������������� ���������������������������������������������� ���������������������������������������������� 42
How to minimize loss? ���������� �������� �������������������������� Start at random point ���������������������������������������������� 43
How to minimize loss? Compute: 44
How to minimize loss? Move in direction opposite of gradient to new point 45
How to minimize loss? Move in direction opposite of gradient to new point 46
How to minimize loss? Repeat! 47
This is called Stochastic Gradient Descent (SGD) Repeat! 48
𝜄 �𝑢+1� � 𝜄 �𝑢� � 𝜃 𝑢 𝛼 𝜄 ℒ Stochastic Gradient Descent (SGD) ��������������������������������� ◦ • Initialize θ randomly ◦ � ��������������������� ��������������������������������� • For N Epochs ������������ � • For each training example ( x , y ) : ��������������������� � � ��������������������������������� � ������������ � ��������������������������������� • Compute Loss Gradient: � ����������������������� � ����������������������� ��������������������������� � • Update θ with update rule: ��������������������������� � ���������������������������������������������� ���������������������������������������������� 49
𝜄 �𝑢+1� � 𝜄 �𝑢� � 𝜃 𝑢 𝛼 𝜄 ℒ Why is it Stochastic Gradient Descent? ◦ • Initialize θ randomly ◦ Only an estimate of ��������������������������������� • For N Epochs true gradient! • For each training example ( x , y ) : ��������������������� � � ������������ � ��������������������������������� • Compute Loss Gradient: � ����������������������� • Update θ with update rule: ��������������������������� � ���������������������������������������������� 50
𝜄 �𝑢+1� � 𝜄 �𝑢� � 𝜃 𝑢 𝛼 𝜄 ℒ Why is it Stochastic Gradient Descent? ◦ ������������������������������������ • Initialize θ randomly ◦ More accurate ��������������������������������� • For N Epochs estimate! ��������������������� � �������������� • For each training batch {( x 0 , y 0 ),…, ( x B , y B )} : � ������������ ��������� ��������������������� � � ������������������ ����������������������� � ��� � ��� � ������������ • Compute Loss Gradient: � ��������������������������������� � ����������������������� � ����������������������� • Update θ with update rule: � ��������������������������� ��������������������������� � Advantages: More accurate estimation of gradient • Smoother convergence ⎯ ⎯ Allows for larger learning rates Minibatches lead to fast training! • Can parallelize computation + achieve ⎯ ���������������������������������������������� ���������������������������������������������� significant speed increases on GPU’s 51
𝜄 �𝑢+1� � 𝜄 �𝑢� � 𝜃 𝑢 𝛼 𝜄 ℒ Stochastic Gradient Descent (SGD) ◦ • ◦ tions • Algorithm that performs updates after each example ze : • θ ⌘ { W (1) , b (1) , . . . , W ( L +1) , b ( L +1) } — initialize • �r — for N iterations - • • ( x ( t ) , y ( t ) ) mple — for each training example or batch r 8 P r Training epoch Training epoch • � • ∆ = �r θ l ( f ( x ( t ) ; θ ) , y ( t ) ) � λ r θ Ω ( θ ) θ ) = = Iteration over all examples Iteration of all examples • θ θ + α ∆ • To apply this algorithm to neural network training, we need: ion: • l ( f ( x ( t ) ; θ ) , y ( t ) ) — the loss function — a procedure to compute the parameter gradients: r — the regularizer ent: , (and the gradient ) , • Ω ( θ ) • r θ Ω ( θ ) 52
𝜄 �𝑢+1� � 𝜄 �𝑢� � 𝜃 𝑢 𝛼 𝜄 ℒ Stochastic Gradient Descent (SGD) ◦ • ◦ tions • Algorithm that performs updates after each example ze : • θ ⌘ { W (1) , b (1) , . . . , W ( L +1) , b ( L +1) } — initialize • �r — for N iterations - • • ( x ( t ) , y ( t ) ) mple — for each training example or batch r 8 P r Training epoch Training epoch • � • ∆ = �r θ l ( f ( x ( t ) ; θ ) , y ( t ) ) � λ r θ Ω ( θ ) θ ) = = Iteration over all examples Iteration of all examples • θ θ + α ∆ • To apply this algorithm to neural network training, we need: ion: • l ( f ( x ( t ) ; θ ) , y ( t ) ) — the loss function — a procedure to compute the parameter gradients: r — the regularizer ent: , (and the gradient ) , • Ω ( θ ) • r θ Ω ( θ ) 53
What is a neural network again? • A family of parametric, non-linear and hierarchical representation learning functions a L ( x ; θ 1 ,...,L ) = h L ( h L − 1 ( . . . h 1 ( x, θ 1 ) , θ L − 1 ) , θ L ) • ⎯ x : input, θ l : parameters for layer l , a l = h l ( x , θ l ) : (non-)linear function • Given training corpus { X , Y } find optimal parameters X ✓ ∗ ← arg min ` ( y, a L ( x ; ✓ 1 ,...,L )) θ ( x,y ) ∈ ( X,Y ) 54
Neural network models • A neural network model is a series of hierarchically connected functions • The hierarchy can be very, very complex Forward connections (Feedforward architecture) h 4 ( x i ; θ ) h 1 ( x i ; θ ) h 3 ( x i ; θ ) h 5 ( x i ; θ ) h 2 ( x i ; θ ) Loss Input 55
Neural network models • A neural network model is a series of hierarchically connected functions • The hierarchy can be very, very complex Loss h 5 ( x i ; θ ) h 4 ( x i ; θ ) h 4 ( x i ; θ ) Interweaved connections (Directed Acyclic Graph architecture – DAGNN) h 3 ( x i ; θ ) h 2 ( x i ; θ ) h 2 ( x i ; θ ) h 1 ( x i ; θ ) Input 56
Neural network models • A neural network model is a series of hierarchically connected functions • The hierarchy can be very, very complex h 4 ( x i ; θ ) h 3 ( x i ; θ ) h 5 ( x i ; θ ) h 1 ( x i ; θ ) h 2 ( x i ; θ ) Loss Input Loopy connections (Recurrent architecture, special care needed) 57
Neural network models • A neural network model is a series of hierarchically connected functions • The hierarchy can be very, very complex Loss h 4 ( x i ; θ ) h 1 ( x i ; θ ) h 3 ( x i ; θ ) h 5 ( x i ; θ ) h 2 ( x i ; θ ) Input Loss h 5 ( x i ; θ ) h 4 ( x i ; θ ) h 4 ( x i ; θ ) Functions → Modules h 3 ( x i ; θ ) h 2 ( x i ; θ ) h 2 ( x i ; θ ) h 4 ( x i ; θ ) h 1 ( x i ; θ ) h 3 ( x i ; θ ) h 5 ( x i ; θ ) h 2 ( x i ; θ ) Input Loss h 1 ( x i ; θ ) Input 58
What is a module • A module is a building block for our network Loss • Each module is an object/function 𝑏 = h ( x ; 𝜄 ) that h 5 ( x i ; θ ) h 4 ( x i ; θ ) ⎯ Contains trainable parameters ( 𝜄 ) ⎯ Receives as an argument an input 𝑦 h 4 ( x i ; θ ) ⎯ And returns an output 𝑏 based on the activation function h (...) • The activation function should be (at least) first order h 3 ( x i ; θ ) differentiable (almost) everywhere h 2 ( x i ; θ ) h 2 ( x i ; θ ) • For easier/more efficient backpropagation → store module input h 1 ( x i ; θ ) ⎯ easy to get module output fast ⎯ easy to compute derivatives Input 59
Anything goes or do special constraints exist? • A neural network is a composition of modules (building blocks) • Any architecture works • If the architecture is a feedforward cascade, no special care • If acyclic, there is right order of computing the forward computations • If there are loops, these form recurrent connections (revisited later) 60
What is a module • Simply compute the activation of each module in the Loss network a l = h l ( x l ; θ ) where or a l = x l +1 x l = a l − 1 h 5 ( x i ; θ ) h 4 ( x i ; θ ) • We need to know the precise function behind each h 4 ( x i ; θ ) module h 𝑚 (... ) h 3 ( x i ; θ ) • Recursive operations • One module’s output is another’s input h 2 ( x i ; θ ) h 2 ( x i ; θ ) • Steps h 1 ( x i ; θ ) • Visit modules one by one starting from the data input • Some modules might have several inputs from multiple modules Input • Compute modules activations with the right order • Make sure all the inputs computed at the right time 61
What is a module • Simply compute the gradients of each module dLoss ( Input ) for our data • We need to know the gradient formulation of each h 5 ( x i ; θ ) h 4 ( x i ; θ ) module 𝜖 h 𝑚 ( 𝑦 𝑚 ; 𝜄 𝑚 ) w.r.t. their inputs 𝑦 𝑚 and parameters 𝜄 𝑚 h 4 ( x i ; θ ) • We need the forward computations first • Their result is the sum of losses for our input data h 3 ( x i ; θ ) • Then take the reverse network (reverse connections) and traverse it backwards h 2 ( x i ; θ ) h 2 ( x i ; θ ) • Instead of using the activation functions, we use h 1 ( x i ; θ ) their gradients • The whole process can be described very neatly and concisely with the backpropagation algorithm 62
Again, what is a neural network again? • d a L ( x ; θ 1 ,...,L ) = h L ( h L − 1 ( . . . h 1 ( x, θ 1 ) , θ L − 1 ) , θ L ) ⎯ x : input, θ l : parameters for layer l , a l = h l ( x , θ l ) : (non-)linear function • Given training corpus { X , Y } find optimal parameters X ✓ ∗ ← arg min ` ( y, a L ( x ; ✓ 1 ,...,L )) θ ( x,y ) ∈ ( X,Y ) ✓ ◆ ∂ L • To use any gradient descent based optimization θ t +1 = θ t − η t ∂θ t we need the gradients ∂ L , l = 1 , . . . , L ∂θ l • How to compute the gradients for such a complicated function enclosing other functions, like 𝑏 𝑀 (... ) ? 63
Again, what is a neural network again? • d a L ( x ; θ 1 ,...,L ) = h L ( h L − 1 ( . . . h 1 ( x, θ 1 ) , θ L − 1 ) , θ L ) ⎯ x : input, θ l : parameters for layer l , a l = h l ( x , θ l ) : (non-)linear function • Given training corpus { X , Y } find optimal parameters X ✓ ∗ ← arg min ` ( y, a L ( x ; ✓ 1 ,...,L )) θ ( x,y ) ∈ ( X,Y ) ✓ ◆ ∂ L • To use any gradient descent based optimization θ t +1 = θ t − η t ∂θ t we need the gradients ∂ L , l = 1 , . . . , L ∂θ l • How to co compute the grad adients s for su such ch a a co complicat cated funct ction encl closi sing other funct ctions, s, like ke 𝑏 𝑀 (... ) ? ? 64
How do we compute gradients? • Numerical Differentiation • Symbolic Differentiation • Automatic Differentiation (AutoDiff) 65
1 i - Vector of all zeros, except for one 1 in i-th location Numerical Differentiation • We can approximate the gradient numerically, using: ∂ f ( x ) f ( x + h 1 i ) − f ( x ) ≈ = lim ∂ x i h h → 0 slide adopted from T. Chen, H. Shen, A. Krishnamurthy 66
1 i - Vector of all zeros, except for one 1 in i-th location Numerical Differentiation • We can approximate the gradient numerically, using: ∂ f ( x ) f ( x + h 1 i ) − f ( x ) ≈ = lim ∂ x i h h → 0 • Even better, we can use central differencing: ∂ f ( x ) f ( x + h 1 i ) − f ( x − h 1 i ) ≈ = lim 2 h ∂ x i h → 0 slide adopted from T. Chen, H. Shen, A. Krishnamurthy 67
1 i - Vector of all zeros, except for one 1 in i-th location Numerical Differentiation • We can approximate the gradient numerically, using: ∂ f ( x ) f ( x + h 1 i ) − f ( x ) ≈ = lim ∂ x i h h → 0 • Even better, we can use central differencing: ∂ f ( x ) f ( x + h 1 i ) − f ( x − h 1 i ) ≈ = lim 2 h ∂ x i h → 0 • However, both of these suffer from rounding errors and are not good enough for learning (they are very good tools for checking the correctness of implementation though, e.g., use h = 0.000001 ). slide adopted from T. Chen, H. Shen, A. Krishnamurthy 68
1 i - Vector of all zeros, except for one 1 in i-th location Numerical Differentiation 1 ij - Matrix of all zeros, except for one 1 in (i,j)-th location • We can approximate the gradient numerically, using: ∂ L ( W , b ) L ( W + h 1 ij , b ) − L ( W , b ) ∂ L ( W , b ) L ( W , b + h 1 j ) − L ( W , b ) ≈ lim ≈ lim ∂ w ij ∂ b j h h h → 0 h → 0 • Even better, we can use central differencing: ∂ L ( W , b ) L ( W + h 1 ij , b ) − L ( W + h 1 ij , b ) ∂ L ( W , b ) L ( W , b + h 1 j ) − L ( W , b + h 1 j ) ≈ lim ≈ lim ∂ w ij 2 h 2 h ∂ b j h → 0 h → 0 • However, both of these suffer from rounding errors and are not good enough for learning (they are very good tools for checking the correctness of implementation though, e.g., use h = 0.000001 ). slide adopted from T. Chen, H. Shen, A. Krishnamurthy 69
y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Symbolic Differentiation • Input function is represented as co computational gr graph ph (a symbolic tree) v 2 x 1 ln ln x 1 + v 0 v 5 v 2 + v 5 v 3 + + v 3 − sin v 1 − v 4 x 2 v 6 y x 2 v 4 y sin Implements differentiation rules for composite functions: • Implements differentiation rules for composite functions: Sum Rule Product Rule Chain Rule d ( f ( x ) + g ( x )) = d f ( x ) + d g ( x ) d ( f ( x ) · g ( x )) = d f ( x ) g ( x ) + f ( x )d g ( x ) d( f ( g ( x ))) = d f ( g ( x )) · d g ( x ) d x d x d x d x d x d x d x d x d x Pr Proble oblem: m: For complex functions, expressions can be exponentially large; also difficult to deal with piece-wise functions (creates many symbolic cases) slide adopted from T. Chen, H. Shen, A. Krishnamurthy 70
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ion: Interleave symbolic differentiation and simplification • In Intuit itio : Apply symbolic differentiation at the elementary • Ke Key Id Idea: operation level, evaluate and keep intermediate results Success of de learning owes A LOT to success of AutoDiff algorithms deep learning (also to advances in parallel architectures, and large datasets, ...) slide adopted from T. Chen, H. Shen, A. Krishnamurthy 71
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 2 v 5 + + v 3 − v 1 v 4 x 2 v 6 y sin • Each node node is an input, intermediate, or output variable • Computat ational al grap aph (a DAG) with variable ordering from topological sort. slide adopted from T. Chen, H. Shen, A. Krishnamurthy 72
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 2 v 5 + Computational graph is governed by these + v 3 equations: v 0 = x 1 − v 1 v 4 x 2 v 6 y sin v 1 = x 2 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output v 3 = v 0 · v 1 variable v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 y = v 6 slide adopted from T. Chen, H. Shen, A. Krishnamurthy 73
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) ln x 1 v 0 v 2 v 5 + Computational graph is governed by these + v 3 equations: v 0 = x 1 − v 1 v 4 x 2 v 6 y sin v 1 = x 2 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output v 3 = v 0 · v 1 variable v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 y = v 6 slide adopted from T. Chen, H. Shen, A. Krishnamurthy 74
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) ln x 1 v 0 v 2 v 5 + Forwar ard Eval valuat ation Trace: + v 3 f (2 , 5) − v 1 v 4 x 2 v 6 y v 0 = x 1 sin v 1 = x 2 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output v 3 = v 0 · v 1 variable v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 y = v 6 75
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) ln x 1 v 0 v 2 v 5 + Forwar ard Eval valuat ation Trace: + v 3 f (2 , 5) − v 1 v 4 x 2 v 6 y 2 v 0 = x 1 sin v 1 = x 2 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output v 3 = v 0 · v 1 variable v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 y = v 6 76
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) ln x 1 v 0 v 2 v 5 + Forwar ard Eval valuat ation Trace: + v 3 f (2 , 5) − v 1 v 4 x 2 v 6 y 2 v 0 = x 1 sin 5 v 1 = x 2 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output v 3 = v 0 · v 1 variable v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 y = v 6 77
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) ln x 1 v 0 v 2 v 5 + Forwar ard Eval valuat ation Trace: + v 3 f (2 , 5) − v 1 v 4 x 2 v 6 y 2 v 0 = x 1 sin 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output v 3 = v 0 · v 1 variable v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 y = v 6 78
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) ln x 1 v 0 v 2 v 5 + Forwar ard Eval valuat ation Trace: + v 3 f (2 , 5) − v 1 v 4 x 2 v 6 y 2 v 0 = x 1 sin 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output 2 x 5 = 10 v 3 = v 0 · v 1 variable v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 y = v 6 79
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) ln x 1 v 0 v 2 v 5 + Forwar ard Eval valuat ation Trace: + v 3 f (2 , 5) − v 1 v 4 x 2 v 6 y 2 v 0 = x 1 sin 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output 2 x 5 = 10 v 3 = v 0 · v 1 variable sin(5) = -0.959 v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 y = v 6 80
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) ln x 1 v 0 v 2 v 5 + Forwar ard Eval valuat ation Trace: + v 3 f (2 , 5) − v 1 v 4 x 2 v 6 y 2 v 0 = x 1 sin 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output 2 x 5 = 10 v 3 = v 0 · v 1 variable sin(5) = -0.959 v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable 0.693 + 10 = 10.693 v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 y = v 6 81
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) ln x 1 v 0 v 2 v 5 + Forwar ard Eval valuat ation Trace: + v 3 f (2 , 5) − v 1 v 4 x 2 v 6 y 2 v 0 = x 1 sin 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output 2 x 5 = 10 v 3 = v 0 · v 1 variable sin(5) = -0.959 v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable 0.693 + 10 = 10.693 v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 y = v 6 82
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) ln x 1 v 0 v 2 v 5 + Forwar ard Eval valuat ation Trace: + v 3 f (2 , 5) − v 1 v 4 x 2 v 6 y 2 v 0 = x 1 sin 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) • Each node node is an input, intermediate, or output 2 x 5 = 10 v 3 = v 0 · v 1 variable sin(5) = -0.959 v 4 = sin ( v 1 ) • Computat ational al grap aph (a DAG) with variable 0.693 + 10 = 10.693 v 5 = v 2 + v 3 ordering from topological sort. v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 83
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 + + v 3 − v 1 v 4 x 2 y v 6 sin Forwar ard Eval valuat ation Trace: f (2 , 5) 2 v 0 = x 1 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 84
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 + Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences) + v 3 − v 1 v 4 x 2 y � v 6 ∂ f ( x 1 , x 2 ) sin � � ∂ x 1 Forwar ard Eval valuat ation Trace: � ( x 1 =2 ,x 2 =5) f (2 , 5) We will do this with for forwa ward mo mode first, 2 v 0 = x 1 by introducing a derivative of each variable 5 v 1 = x 2 node with respect to the input variable. ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 85
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v ∂ v f (2 , 5) 2 v 0 = x 1 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 86
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v ∂ v f (2 , 5) 2 v 0 = x 1 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 87
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v 1 ∂ v ∂ x 1 ∂ x 1 f (2 , 5) ∂ v 2 v 0 = x 1 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 88
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v 1 ∂ v 0 ∂ x 1 ∂ x 1 f (2 , 5) ∂ v 2 v 0 = x 1 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 89
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v 1 ∂ v 0 ∂ x 1 ∂ x 1 f (2 , 5) ∂ v ∂ v 2 = ∂ x 1 2 v 0 = x 1 1 ∂ v 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 90
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v ∂ v 1 0 ∂ x 1 ∂ x 1 f (2 , 5) ∂ v ∂ v 2 = ∂ x 1 2 v 0 = x 1 1 Chain Rule ∂ v 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 91
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v 1 ∂ v 0 ∂ x 1 ∂ x 1 f (2 , 5) ∂ v ∂ v 2 = 1 ∂ v 0 ∂ x 1 ∂ x 1 2 v 0 v 0 = x 1 Chain Rule 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 92
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v ∂ v 1 0 ∂ x 1 ∂ x 1 f (2 , 5) ∂ v ∂ v 2 = 1 ∂ v 0 1/2 * 1 = 0.5 ∂ x 1 ∂ x 1 2 v 0 v 0 = x 1 Chain Rule 5 v 1 = x 2 ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 93
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v ∂ v 1 0 ∂ x 1 ∂ x 1 f (2 , 5) ∂ v ∂ v 2 = 1 ∂ v 0 1/2 * 1 = 0.5 ∂ x 1 ∂ x 1 2 v 0 v 0 = x 1 1 0 1 ∂ v 3 = ∂ v 0 5 v 1 = x 2 · v 1 ∂ x 1 ∂ x 1 ln(2) = 0.693 v 2 = ln( v 0 ) 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 94
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v 1 ∂ v 0 ∂ x 1 ∂ x 1 f (2 , 5) ∂ v ∂ v 2 = 1 ∂ v 0 1/2 * 1 = 0.5 ∂ x 1 ∂ x 1 2 v 0 v 0 = x 1 1 0 1 ∂ v 3 = ∂ v 0 5 v 1 = x 2 · v 1 ∂ x 1 ∂ x 1 ln(2) = 0.693 v 2 = ln( v 0 ) Product Rule 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 95
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v ∂ v 1 0 ∂ x 1 ∂ x 1 f (2 , 5) ∂ v ∂ v 2 = 1 ∂ v 0 1/2 * 1 = 0.5 ∂ x 1 ∂ x 1 2 v 0 v 0 = x 1 1 0 1 ∂ v 3 = ∂ v 0 · v 1 + v 0 · ∂ v 1 5 v 1 = x 2 ∂ x 1 ∂ x 1 ∂ x 1 ln(2) = 0.693 v 2 = ln( v 0 ) Product Rule 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 96
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v 1 ∂ v 0 ∂ x 1 ∂ x 1 f (2 , 5) ∂ v ∂ v 2 = 1 ∂ v 0 1/2 * 1 = 0.5 ∂ x 1 ∂ x 1 2 v 0 v 0 = x 1 1 0 1 ∂ v 3 = ∂ v 0 · v 1 + v 0 · ∂ v 1 1*5 + 2*0 = 5 5 v 1 = x 2 ∂ x 1 ∂ x 1 ∂ x 1 ln(2) = 0.693 v 2 = ln( v 0 ) Product Rule 2 x 5 = 10 v 3 = v 0 · v 1 sin(5) = -0.959 v 4 = sin ( v 1 ) v 5 = v 2 + v 3 0.693 + 10 = 10.693 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 11.652 y = v 6 97
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 Forwar ard Eval valuat ation Trace: ∂ v ∂ v 1 0 ∂ x 1 ∂ x 1 f (2 , 5) ∂ v ∂ v 2 = 1 ∂ v 0 1/2 * 1 = 0.5 ∂ x 1 ∂ x 1 2 v 0 v 0 = x 1 1 0 1 ∂ v 3 = ∂ v 0 · v 1 + v 0 · ∂ v 1 1*5 + 2*0 = 5 5 v 1 = x 2 ∂ x 1 ∂ x 1 ∂ x 1 ln(2) = 0.693 v 2 = ln( v 0 ) ∂ v 4 = ∂ v 1 0 * cos(5) = 0 cos ( v 1 ) 2 x 5 = 10 v 3 = v 0 · v 1 ∂ x 1 ∂ x 1 1 1 sin(5) = -0.959 ∂ v 5 = ∂ v 2 + ∂ v 3 v 4 = sin ( v 1 ) 0.5 + 5 = 5.5 ∂ x 1 ∂ x 1 ∂ x 1 v 5 = v 2 + v 3 0.693 + 10 = 10.693 1 1 1 ∂ v 6 = ∂ v 5 − ∂ v 4 5.5 – 0 = 5.5 v 6 = v 5 − v 4 10.693 + 0.959 = 11.652 ∂ x 1 ∂ x 1 ∂ x 1 1 1 1 11.652 y = v 6 ∂ y = ∂ v 6 5.5 ∂ x 1 ∂ x 1 98
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 ∂ v ∂ v 1 0 ∂ x 1 ∂ x 1 ∂ v ∂ v 2 = 1 ∂ v 0 1/2 * 1 = 0.5 We now have: ∂ x 1 ∂ x 1 v 0 1 0 1 ∂ v 3 = ∂ v 0 · v 1 + v 0 · ∂ v 1 � ∂ f ( x 1 , x 2 ) 1*5 + 2*0 = 5 � = 5 . 5 ∂ x 1 ∂ x 1 ∂ x 1 � ∂ x 1 � ( x 1 =2 ,x 2 =5) ∂ v 4 = ∂ v 1 0 * cos(5) = 0 cos ( v 1 ) ∂ x 1 ∂ x 1 1 1 ∂ v 5 = ∂ v 2 + ∂ v 3 0.5 + 5 = 5.5 ∂ x 1 ∂ x 1 ∂ x 1 1 1 1 ∂ v 6 = ∂ v 5 − ∂ v 4 5.5 – 0 = 5.5 ∂ x 1 ∂ x 1 ∂ x 1 1 1 1 ∂ y = ∂ v 6 5.5 ∂ x 1 ∂ x 1 99
Automatic Differentiation (AutoDiff) y = f ( x 1 , x 2 ) = ln( x 1 ) + x 1 x 2 − sin ( x 2 ) ln x 1 v 0 v 5 v 2 ve Trace: + Forwar ard Derivat vative Forward Derivative Trace: � ∂ f ( x 1 , x 2 ) � + v 3 � ∂ x 1 � ( x 1 =2 ,x 2 =5) − v 1 v 4 ∂ v 0 x 2 y v 6 1 sin ∂ x 1 1 ∂ v ∂ v 1 0 ∂ x 1 ∂ x 1 ∂ v ∂ v 2 = 1 ∂ v 0 1/2 * 1 = 0.5 We now have: ∂ x 1 ∂ x 1 v 0 1 0 1 ∂ v 3 = ∂ v 0 · v 1 + v 0 · ∂ v 1 � ∂ f ( x 1 , x 2 ) 1*5 + 2*0 = 5 � = 5 . 5 ∂ x 1 ∂ x 1 ∂ x 1 � ∂ x 1 � ( x 1 =2 ,x 2 =5) ∂ v 4 = ∂ v 1 0 * cos(5) = 0 cos ( v 1 ) ∂ x 1 ∂ x 1 Still need: 1 1 ∂ v 5 = ∂ v 2 + ∂ v 3 0.5 + 5 = 5.5 ∂ x 1 ∂ x 1 ∂ x 1 � ∂ f ( x 1 , x 2 ) 1 1 1 � ∂ v 6 = ∂ v 5 − ∂ v 4 5.5 – 0 = 5.5 � ∂ x 2 � ∂ x 1 ∂ x 1 ∂ x 1 ( x 1 =2 ,x 2 =5) 1 1 1 ∂ y = ∂ v 6 5.5 ∂ x 1 ∂ x 1 100
Recommend
More recommend