Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V 5 . 1
Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V simpler to write this for one instance (n) 5 . 1
Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V simpler to write this for one instance (n) ∂ ∂ L , ( n ) y ∂ N ∂ ^ ( n ) so we will calculate and recover and ∂ N ∂ ( n ) y J = L ( y , ) L ^ ( n ) ∑ n =1 J = L ( y , ) ∑ n =1 ∂ W ∂ V ∂ V ∂ V ∂ W ∂ W 5 . 1
Gradient calculation Gradient calculation ... ^ 1 ^ 2 ^ C y y y pre-activations ... z 2 z M z 1 pre-activations ... x 1 x 2 x D 5 . 2
Gradient calculation Gradient calculation L ( y , ) ^ y ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c M u = ∑ m =1 W z c , m m c pre-activations ... = h ( q ) z z 2 z M z 1 m m D = ∑ d =1 q V x m , d m d pre-activations ... x 1 x 2 x d x D 5 . 2
Gradient calculation Gradient calculation using the chain rule ∂ y ^ c ∂ ∂ L ∂ u c L = L ( y , ) ^ y ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c pre-activations z m ... = h ( q ) z z 2 z M z 1 m m D = ∑ d =1 q V x m , d m d pre-activations ... x 1 x 2 x d x D 5 . 2
Gradient calculation Gradient calculation using the chain rule ∂ y ^ c ∂ ∂ L ∂ u c L = L ( y , ) ^ y ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c pre-activations z m ... = h ( q ) z z 2 z M z 1 m m similarly for V D = ∑ d =1 ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m q V x L = ∑ c ∂ y m , d m d pre-activations ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d ... x 1 x 2 x d x D W c , m depends on the loss function x d depends on the activation function depends on the middle layer activation 5 . 2
Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m = h ( q ) z m m D = ∑ d =1 q V x m , d m d x d 5 . 3
Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d x d 5 . 3
Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d substituting 1 x d 2 L ( y , z ) = ∣∣ y − Wz ∣∣ 2 2 5 . 3
Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d substituting 1 x d 2 L ( y , z ) = ∣∣ y − Wz ∣∣ 2 2 taking derivative ∂ L = ( ^ c − y ) z y we have seen this in linear regression lecture ∂ W c , m c m 5 . 3
Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d 5 . 4
Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d substituting and simplifying (see logistic regression lecture) − u L ( y , u ) = y log(1 + e ) + (1 − y ) log(1 + e ) u 5 . 4
Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d substituting and simplifying (see logistic regression lecture) { u = − u L ( y , u ) = y log(1 + e ) + (1 − y ) log(1 + e ) u ∑ m W z m m ∂ L = ( − ^ y ) z m y substituting u in L and taking derivative ∂ W m 5 . 4
Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d 5 . 5
Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d substituting and simplifying (see logistic regression lecture) ⊤ L ( y , u ) = − y u + log ∑ c u e 5 . 5
Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d substituting and simplifying (see logistic regression lecture) { u = ⊤ L ( y , u ) = − y u + log ∑ c u e ∑ m W z c , m m c ∂ L = ( ^ c − y ) z substituting u in L and taking derivative y c m ∂ W c , m 5 . 5
Gradient calculation Gradient calculation gradient wrt V: ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z c , m m c = h ( q ) z m m D = ∑ d =1 q V x m , d m d x d 5 . 6
Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z c , m m c = h ( q ) z m m D = ∑ d =1 q V x m , d m d x d 5 . 6
Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d x d 5 . 6
Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d 5 . 6
Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d ( n ) ( n ) ( n ) ( n ) ( n ) = ( ^ c − ) W (1 − ) x ∑ n ∑ c y y z z c , m m c m d 5 . 6
Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d ( n ) ( n ) ( n ) ( n ) ( n ) = ( ^ c − ) W (1 − ) x ∑ n ∑ c y y z z c , m m c m d ( n ) for biases we simply assume the input is 1. x = 1 0 5 . 6
Gradient calculation Gradient calculation a common pattern ∂ y ^ c ∂ ∂ L ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ∂ L error from above ∂ u c z m input from below ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m L = ∑ c ∂ y ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d ∂ L error from above ∂ q m x d input from below 5 . 7
Gradient calculation Gradient calculation a common pattern L ( y , ) ^ y ∂ y ^ c ∂ ∂ L ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c ∂ L error from above ∂ u c z m input from below M u = ∑ m =1 W z c , m m c = h ( q ) z ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m m m L = ∑ c ∂ y ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d D ∂ L = ∑ d =1 q V x error from above ∂ q m m , d m d x d input from below x d 5 . 7 Winter 2020 | Applied Machine Learning (COMP551)
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c = σ ( q ) z m m D = ∑ d =1 q V x m , d m d x d 6 . 1
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 Y, #N x C 7 m d 8 def softmax( 3 W, #M x C 9 u, # N x C 10 ): 4 V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 ): 6 Q = np.dot(X, V) #N x M 7 Z = logistic(Q) #N x M 8 U = np.dot(Z, W) #N x K 9 Yh = softmax(U) 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 11 return nll 6 . 1
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 ): ): 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 return nll return nll 6 . 1
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 ): ): ): 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 return nll return nll return nll 6 . 1
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 ): ): ): ): 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 return nll return nll return nll return nll 6 . 1
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 5 ): ): ): ): ): 6 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 11 return nll return nll return nll return nll return nll 6 . 1
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 5 5 ): ): ): ): ): ): 6 6 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 11 11 return nll return nll return nll return nll return nll return nll 6 . 1
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c = σ ( q ) z m m D = ∑ d =1 q V x m , d m d x d 6 . 2
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 Y,#N x K y m m m d ∂ V m , d 3 W,#M x K D = ∑ d =1 q V x 4 V,#D x M m , d m d 5 ): 6 Z = logistic(np.dot(X, V))#N x M x d 7 N,D = X.shape 8 Yh = softmax(np.dot(Z, W))#N x K 9 dY = Yh - Y #N x K 10 dW= np.dot(Z.T, dY)/N #M x K 11 dZ = np.dot(dY, W.T) #N x M 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 return dW, dV 6 . 2
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 V,#D x M V,#D x M m , d m d 5 5 ): ): 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 N,D = X.shape N,D = X.shape 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 ): ): ): 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 13 return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 2 Y,#N x K Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 3 W,#M x K W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 4 V,#D x M V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 5 ): ): ): ): 6 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 13 13 return dW, dV return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 2 2 Y,#N x K Y,#N x K Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 3 3 W,#M x K W,#M x K W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 4 4 V,#D x M V,#D x M V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 5 5 ): ): ): ): ): 6 6 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 13 13 13 return dW, dV return dW, dV return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ y M = 16 hidden units C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c ∂ L = ( − ^ y ) z m y = σ ( q ) ∂ W m z 1 1 1 1 1 1 def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D def gradients(X,#N x D m m ∂ L = ( − ^ y ) W z (1 − z ) x 2 2 2 2 2 2 Y,#N x K Y,#N x K Y,#N x K Y,#N x K Y,#N x K Y,#N x K y m m m d ∂ V m , d 3 3 3 3 3 3 W,#M x K W,#M x K W,#M x K W,#M x K W,#M x K W,#M x K D = ∑ d =1 q V x 4 4 4 4 4 4 V,#D x M V,#D x M V,#D x M V,#D x M V,#D x M V,#D x M m , d m d 5 5 5 5 5 5 ): ): ): ): ): ): 6 6 6 6 6 6 Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M Z = logistic(np.dot(X, V))#N x M x d 7 7 7 7 7 7 N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape N,D = X.shape 8 8 8 8 8 8 Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K Yh = softmax(np.dot(Z, W))#N x K 9 9 9 9 9 9 dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K dY = Yh - Y #N x K check your gradient function using finite difference 10 10 10 10 10 10 dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K dW= np.dot(Z.T, dY)/N #M x K approximation that uses the cost function 11 11 11 11 11 11 dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M dZ = np.dot(dY, W.T) #N x M 12 12 12 12 12 12 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 13 13 return dW, dV 13 13 13 13 return dW, dV return dW, dV return dW, dV return dW, dV return dW, dV 1 scipy.optimize.check_grad 6 . 2
Example: Example: classification classification Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes using GD for optimization 1 def GD(X, Y, M, lr=.1, eps=1e-9, max_iters=100000): 2 N, D = X.shape 3 N,K = Y.shape 4 W = np.random.randn(M, K)*.01 5 V = np.random.randn(D, M)*.01 6 dW = np.inf*np.ones_like(W) 7 t = 0 8 while np.linalg.norm(dW) > eps and t < max_iters: 9 dW, dV = gradients(X, Y, W, V) 10 W = W - lr*dW 11 V = V - lr*dV 12 t += 1 13 return W, V 6 . 3
Example: Example: classification classification Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes using GD for optimization the resulting decision boundaries 1 def GD(X, Y, M, lr=.1, eps=1e-9, max_iters=100000): 2 N, D = X.shape 3 N,K = Y.shape 4 W = np.random.randn(M, K)*.01 5 V = np.random.randn(D, M)*.01 6 dW = np.inf*np.ones_like(W) 7 t = 0 8 while np.linalg.norm(dW) > eps and t < max_iters: 9 dW, dV = gradients(X, Y, W, V) 10 W = W - lr*dW 11 V = V - lr*dV 12 t += 1 13 return W, V 6 . 3 Winter 2020 | Applied Machine Learning (COMP551)
Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? 7 . 1
Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? using numerical differentiation? ∂ f f ( w + ϵ )− f ( w ) ≈ approximates partial derivatives using finite difference ∂ w ϵ needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions 7 . 1
Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? using numerical differentiation? ∂ f f ( w + ϵ )− f ( w ) ≈ approximates partial derivatives using finite difference ∂ w ϵ needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions symbolic differentiation : symbolic calculation of derivatives does not identify the computational procedure and reuse of values 7 . 1
Automating gradient computation Automating gradient computation gradient computation is tedious and mechanical. can we automate it? using numerical differentiation? ∂ f f ( w + ϵ )− f ( w ) ≈ approximates partial derivatives using finite difference ∂ w ϵ needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions symbolic differentiation : symbolic calculation of derivatives does not identify the computational procedure and reuse of values automatic / algorithmic differentiation is what we want write code that calculates various functions, e.g., the cost function automatically produce (partial) derivatives e.g., gradients used in learning 7 . 1
Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x use a computational graph as a data structure (for storing the result of computation) 7 . 2
Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 2 a = x 2 a = y 3 a = a × a 2 4 1 a = a − a 3 5 4 2 a = a 5 6 a = .5 × a 6 7 7 . 2
Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a = a × a 2 4 1 a = a − a 3 5 4 2 a = a 5 6 a = .5 × a 6 7 7 . 2
Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 a = a − a 3 5 4 a 5 2 a = a 5 6 a = .5 × a 6 7 a 4 a 3 a 1 a 2 7 . 2
Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 a = .5 × a 6 7 a 4 a 3 a 1 a 2 7 . 2
Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 forward mode: start from the leafs and propagate derivatives upward a = .5 × a 6 7 a 4 a 3 a 1 a 2 7 . 2
Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 forward mode: start from the leafs and propagate derivatives upward a = .5 × a 6 7 a 4 a 3 reverse mode: 1. first in a bottom-up (forward) pass calculate the values a , … , a 1 4 a 1 a 2 2. in a top-down (backward) pass calculate the derivatives 7 . 2
Automatic differentiation Automatic differentiation 1 use the chain rule + derivative of simple operations ∗, sin, ... idea x L use a computational graph as a data structure (for storing the result of computation) step 1 a = break down to atomic operations 1 wx ) 2 w L = ( y − 1 a 7 2 a = x 2 build a graph with operations as internal step 2 a = y nodes and input variables as leaf nodes 3 a 6 a = a × a 2 4 1 step 3 a = a − there are two ways to use the computational graph to calculate derivatives a 3 5 4 a 5 2 a = a 5 6 forward mode: start from the leafs and propagate derivatives upward a = .5 × a 6 7 a 4 a 3 reverse mode: 1. first in a bottom-up (forward) pass calculate the values a , … , a 1 4 a 1 a 2 2. in a top-down (backward) pass calculate the derivatives this second procedure is called backpropagation when applied to neuran networks 7 . 2
Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 7 . 3
Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 7 . 3
Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation a = w 0 1 a = w 1 2 a = x 3 7 . 3
Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives ˙ = 0 a = w 0 a 1 1 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 7 . 3
Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 7 . 3
Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 7 . 3
Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 7 . 3
Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 ∂ y 1 y = sin( w x + w ) x cos( w x + w ) = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 1 0 1 0 5 6 5 ∂ w 1 7 . 3
Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 ∂ y 1 y = sin( w x + w ) x cos( w x + w ) = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 1 0 1 0 5 6 5 ∂ w 1 ∂ y 2 y = cos( w x + w ) a = cos( a ) ˙ = − ˙ sin( a ) − x sin( w x + w ) = a 7 a 5 2 1 0 7 5 1 0 5 ∂ w 1 7 . 3
Forward mode Forward mode { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 ∂ y 1 ∂ y 2 y , y we can calculate both and derivatives in a single forward pass 1 2 ∂ w 1 ∂ w 1 evaluation partial derivatives } ˙ = 0 a = w 0 a 1 1 we initialize these to identify which derivative we want a = ˙ = 1 w 1 a 2 ˙ ∂ □ 2 □ = this means ∂ w 1 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × w x a 3 a 4 a 3 a 2 a 3 x 1 4 2 2 w x + a = a + ˙ = ˙ + ˙ x w 0 a 1 a 5 a 4 a 1 1 5 4 ∂ y 1 y = sin( w x + w ) x cos( w x + w ) = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 1 0 1 0 5 6 5 ∂ w 1 ∂ y 2 y = cos( w x + w ) a = cos( a ) ˙ = − ˙ sin( a ) − x sin( w x + w ) = a 7 a 5 2 1 0 7 5 1 0 5 ∂ w 1 ∂ □ note that we get all partial derivatives in one forward pass ∂ w 1 7 . 3
Forward mode: computational graph Forward mode: computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a + ˙ = ˙ + ˙ a 1 a 5 a 4 a 1 5 4 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 y = a = cos( a ) ˙ = − ˙ cos( a ) a 7 a 5 2 7 5 5 7 . 4
Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a 5 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a + ˙ = ˙ + ˙ a 1 a 5 a 4 a 1 a 4 a 1 5 4 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 a 2 a 3 y = a = cos( a ) ˙ = − ˙ cos( a ) a 7 a 5 2 7 5 5 7 . 4
Forward mode: computational graph Forward mode: computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a 5 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4
Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = ˙ = 0 x a 3 3 a 5 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4
Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives a = ˙ = 0 w 0 a 1 1 a 6 a 7 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4
Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives y = a = sin( a ) 1 6 5 a = ˙ = 0 w 0 a 1 ∂ y 1 1 = ˙ = ˙ cos( a ) a 6 a 7 a 6 a 5 5 ∂ w 1 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4
Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph evaluation partial derivatives y = a = cos( a ) y = a = sin( a ) 2 7 5 1 6 5 a = ˙ = 0 w 0 a 1 ∂ y 1 ∂ y 2 1 = ˙ = ˙ cos( a ) a 6 a 7 = ˙ = − ˙ cos( a ) a 6 a 5 a 7 a 5 5 5 ∂ w 1 ∂ w 1 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4
Forward mode: Forward mode: computational graph computational graph { y = sin( w x + w ) ∂ y 1 1 1 0 suppose we want the derivative where y = cos( w x + w ) ∂ w 1 2 1 0 we can represent this computation using a graph once the nodes up stream calculate their values and derivatives we may discard a node e.g., once are obtained we can discard the values and partial derivatives for a , ˙ a , ˙ , a , ˙ 5 a 5 4 a 4 1 a 1 evaluation partial derivatives y = a = cos( a ) y = a = sin( a ) 2 7 5 1 6 5 a = ˙ = 0 w 0 a 1 ∂ y 1 ∂ y 2 1 = ˙ = ˙ cos( a ) a 6 a 7 = ˙ = − ˙ cos( a ) a 6 a 5 a 7 a 5 5 5 ∂ w 1 ∂ w 1 a = ˙ = 1 w 1 a 2 2 a = a + a = a 1 ˙ = 0 5 4 x a 3 3 a 5 ˙ = ˙ + ˙ a 5 a 4 a 1 a = a × ˙ = a × ˙ + ˙ × a 3 a 4 a 3 a 2 a 3 4 2 2 a = a × a 3 4 2 a = a = a + ˙ = ˙ + ˙ w 0 a 1 a 5 a 4 a 1 a 4 a 1 1 5 4 ˙ = a × ˙ + ˙ × a 4 a 3 a 2 a 3 2 ˙ = 0 a 1 y = a = sin( a ) ˙ = ˙ cos( a ) a 6 a 5 1 6 5 5 ˙ = 1 a 2 a 3 ˙ = 0 a 2 y = a = cos( a ) ˙ = − ˙ cos( a ) a 3 a 7 a 5 2 7 5 5 a = x a = 3 w 1 2 7 . 4
Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 7 . 5
Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = w 0 1 a = w 1 2 a = x 3 a = a × w x a 3 1 4 2 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5
Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2 a = x 3 a = a × w x a 3 1 4 2 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5
Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ˉ a 7 ∂ y 2 3 □ = this means ∂ □ a = a × ˉ = w x a 3 a 6 1 4 2 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5
Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ∂ y 2 ˉ a 7 ∂ y 2 3 = 1 □ = this means ∂ y 2 ∂ □ a = a × ˉ = w x a 3 ∂ y 2 a 6 1 4 2 = 0 ∂ y 1 a = a + w x + a 1 w 0 5 4 1 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5
Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ∂ y 2 ˉ a 7 ∂ y 2 3 = 1 □ = this means ∂ y 2 ∂ □ a = a × ˉ = w x a 3 ∂ y 2 a 6 1 4 2 = 0 ∂ y 1 a = a + ∂ y 2 ∂ y 2 ∂ y 2 ∂ a 7 ∂ a 6 w x + a 1 ˉ = ˉ cos( a ) − ˉ sin( a ) = + = − sin( w x + w ) w 0 5 4 a 5 a 6 a 7 1 1 0 5 5 ∂ a 5 ∂ a 7 ∂ a 5 ∂ a 6 ∂ a 5 y = a = sin( a ) y = sin( w x + w ) 1 6 5 1 1 0 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5
Reverse mode Reverse mode ∂ y 2 y = cos( w x + w ) suppose we want the derivative where 2 1 0 ∂ w 1 first do a forward pass for evaluation 1) evaluation a = then use these values to calculate partial derivatives in a backward pass w 0 1 a = w 1 2) partial derivatives 2 0 } a = x ˉ = 1 ∂ y 2 ˉ a 7 ∂ y 2 3 = 1 □ = this means ∂ y 2 ∂ □ a = a × ˉ = w x a 3 ∂ y 2 a 6 1 4 2 = 0 ∂ y 1 a = a + ∂ y 2 ∂ y 2 ∂ y 2 ∂ a 7 ∂ a 6 w x + a 1 ˉ = ˉ cos( a ) − ˉ sin( a ) = + = − sin( w x + w ) w 0 5 4 a 5 a 6 a 7 1 1 0 5 5 ∂ a 5 ∂ a 7 ∂ a 5 ∂ a 6 ∂ a 5 y = a = sin( a ) y = sin( w x + w ) ∂ y 2 ˉ = ˉ = − sin( w x + w ) 1 6 5 a 4 a 5 1 1 0 1 0 ∂ a 4 y = a = cos( a ) y = cos( w x + w ) 2 7 5 2 1 0 7 . 5
Recommend
More recommend