Applied Machine Learning Applied Machine Learning Gradient Computation & Automatic Differentiation Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1
Learning objectives Learning objectives using the chain rule to calculate the gradients automatic differentiation forward mode reverse mode (backpropagation) 2
Landscape of the cost function Landscape of the cost function model two layer MLP ( n ) ( n ) objective min L ( y , f ( x ; W , V )) W , V ∑ n f ( x ; W , V ) = g ( Wh ( V x ) ) loss function depends on the task there are exponentially many global optima: this is a non-convex optimization problem given one global optimum we can many critical points (points where gradient is zero) permute hidden units in each layer for symmetric activations: negate input/ouput of a unit for rectifiers: rescale input/output of a unit general beliefs supported by empirical and theoretical results in a special settings many more saddle points than local minima these are not stable and SGD can escape number of local minima increases for lower costs therefore most local optima are close to global optima strategy use gradient descent methods (covered earlier in the course) image credit: https://www.offconvex.org 3
Jacobian matrix Jacobian matrix R f : R → R we have the derivative d f ( w ) ∈ dw f : R R D → gradient is the vector of all partial derivatives ∂ ∂ ⊤ R D ∇ f ( w ) = [ f ( w ), … , f ( w )] ∈ w ∂ w 1 ∂ w D f : R R M the Jacobian matrix of all partial derivatives → D ∂ f ( w ) ∂ w 1 ⎡ ⎤ ∂ f ( w ) ∂ f ( w ) ∇ f ( w ) , … , 1 1 1 w ⎢ ⎥ ⎢ ⎥ ∂ w 1 ∂ w D ⎢ ⎥ ∈ R M × D J = ⋮ ⋱ ⋮ ⎣ ⎦ note that we use J also for cost function ∂ f ( w ) ∂ f ( w ) , … , M M ∂ w 1 ∂ w D ∂ f ( w ) for all three case we may simply write , where M,D will be clear from the context ∂ w what if W is a matrix? we assume it is reshaped it into a vector for these calculations 4 . 1
Chain rule Chain rule x , y , z ∈ R f : x ↦ z and h : z ↦ y for where dy dy dz = dx dz dx speed of change in z as we change x speed of change in y as we change z speed of change in y as we change x ∂ y c ∂ y c M ∂ z m = ∑ m =1 x ∈ R , z ∈ R R C D M , y ∈ more generally ∂ x d ∂ z m ∂ x d y c we are looking at all the "paths" through which change in changes and add their contribution x d ∂ y ∂ y ∂ z = in matrix form ∂ x ∂ z ∂ x M x D Jacobian C x D Jacobian C x M Jacobian 4 . 2 Winter 2020 | Applied Machine Learning (COMP551)
Training a two layer network Training a two layer network suppose we have x , … , x D inputs 1 D ^ 1 , … , ^ C C outputs y y ... M hidden units z , … , z ^ 1 ^ 2 ^ C y y y output 1 M W ^ = g ( W h ( V x ) ) model y ... z 2 1 hidden units z 1 z M Cost function we want to minimize V ... 1 input x 1 x 2 x D ( n ) ( n ) J ( W , V ) = L ( y , g ( W h ( V x )) ∑ n for simplicity we drop the bias terms ∂ ∂ J , need gradient wrt W and V: J ∂ W ∂ V simpler to write this for one instance (n) ∂ ∂ L , ( n ) y ∂ N ∂ ^ ( n ) so we will calculate and recover and ∂ N ∂ ( n ) y J = L ( y , ) L ^ ( n ) ∑ n =1 J = L ( y , ) ∑ n =1 ∂ W ∂ V ∂ V ∂ V ∂ W ∂ W 5 . 1
Gradient calculation Gradient calculation using the chain rule ∂ y ^ c ∂ ∂ L ∂ u c L = L ( y , ) ^ y ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ... ^ c = g ( u ) y ^ 1 ^ 2 ^ C y y y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c pre-activations z m ... = h ( q ) z z 2 z M z 1 m m similarly for V D = ∑ d =1 ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m q V x L = ∑ c ∂ y m , d m d pre-activations ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d ... x 1 x 2 x d x D W c , m depends on the loss function x d depends on the activation function depends on the middle layer activation 5 . 2
Gradient calculation Gradient calculation using the chain rule ∂ ∂ L ∂ y ^ c ∂ u c L = L ( y , ) ^ ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m y ^ c = g ( u ) depends on the loss function y c depends on the activation function M u = ∑ m =1 W z c , m m c z m { = h ( q ) z ^ = g ( u ) = u = Wz y m m regression 1 2 L ( y , ) = ^ ∣∣ y − ^ 2 ∣∣ y y 2 D = ∑ d =1 q V x m , d m d substituting 1 x d 2 L ( y , z ) = ∣∣ y − Wz ∣∣ 2 2 taking derivative ∂ L = ( ^ c − y ) z y we have seen this in linear regression lecture ∂ W c , m c m 5 . 3
Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m scalar output C=1 { − u −1 ^ = g ( u ) = ( 1 + e ) y binary classification D = ∑ d =1 q V x m , d m d L ( y , ) = ^ y log ^ + (1 − y ) log(1 − ) ^ y y y x d substituting and simplifying (see logistic regression lecture) { u = − u L ( y , u ) = y log(1 + e ) + (1 − y ) log(1 + e ) u ∑ m W z m m ∂ L = ( − ^ y ) z m y substituting u in L and taking derivative ∂ W m 5 . 4
Gradient calculation Gradient calculation using the chain rule L ( y , ) ^ y ∂ ∂ L ∂ y ^ c ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c depends on the loss function M u = ∑ m =1 W z depends on the activation function c , m m c z m = h ( q ) z m m { y = g ( u ) = softmax( u ) D = multiclass classification ∑ d =1 q V x m , d m d L ( y , ) = ^ ∑ k y log ^ k y y k C is the number of classes x d substituting and simplifying (see logistic regression lecture) { u = ⊤ L ( y , u ) = − y u + log ∑ c u e ∑ m W z c , m m c ∂ L = ( ^ c − y ) z substituting u in L and taking derivative y c m ∂ W c , m 5 . 5
Gradient calculation Gradient calculation gradient wrt V: we already did this part ∂ y ^ c ∂ q m ∂ ∂ L ∂ u m ∂ z m L = ∑ c ∂ y L ( y , ) ^ y ∂ V m , d ^ c ∂ u m ∂ z m ∂ q m ∂ V m , d W k , m x d ^ c = g ( u ) y c depends on the middle layer activation M u = ∑ m =1 W z logistic function σ ( q )(1 − σ ( q )) c , m m c m m hyperbolic tan. m 2 1 − tanh( q ) = h ( q ) z m m { 0 ≤ 0 q ReLU m 1 > 0 q D = ∑ d =1 m q V x m , d m d example logistic sigmoid x d ( n ) ( n ) ( n ) ( n ) ( n ) ∂ J = ∑ n ∑ c y ( ^ c − ) W σ ( q )(1 − σ ( q )) x y c , m c m m ∂ V m , d d ( n ) ( n ) ( n ) ( n ) ( n ) = ( ^ c − ) W (1 − ) x ∑ n ∑ c y y z z c , m m c m d ( n ) for biases we simply assume the input is 1. x = 1 0 5 . 6
Gradient calculation Gradient calculation a common pattern L ( y , ) ^ y ∂ y ^ c ∂ ∂ L ∂ u c L = ∂ W c , m ∂ y ^ c ∂ u c ∂ W c , m ^ c = g ( u ) y c ∂ L error from above ∂ u c z m input from below M u = ∑ m =1 W z c , m m c = h ( q ) z ∂ ∂ L ∂ y ^ c ∂ u c ∂ z m ∂ q m m m L = ∑ c ∂ y ∂ V m , d ^ c ∂ u c ∂ z m ∂ q m ∂ V m , d D ∂ L = ∑ d =1 q V x error from above ∂ q m m , d m d x d input from below x d 5 . 7 Winter 2020 | Applied Machine Learning (COMP551)
Example: classification Example: classification Iris dataset (D=2 features + 1 bias) L ( y , ) ^ M = 16 hidden units y C=3 classes ^ = softmax( u ) y M u = ∑ m =1 W z c , m m c cost is softmax-cross-entropy helper functions = σ ( q ) z 1 def logsumexp( m m 2 Z,# NxC 3 ): 4 Zmax = np.max(Z,axis=1)[:, None] 1 1 1 1 1 1 def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D def cost(X, #N x D 5 lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] D = 6 return lse #N ∑ d =1 q V x m , d 2 2 2 2 2 2 Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C Y, #N x C 7 m d 8 def softmax( 3 3 3 3 3 3 W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C W, #M x C 9 u, # N x C 10 ): 4 4 4 4 4 4 V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M V, #D x M 11 u_exp = np.exp(u - np.max(u, 1)[:, None]) x d 12 return u_exp / np.sum(u_exp, axis=-1)[:, None] 5 5 5 5 5 5 ): ): ): ): ): ): 6 6 6 6 6 6 Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M Q = np.dot(X, V) #N x M 7 7 7 7 7 7 Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M Z = logistic(Q) #N x M 8 8 8 8 8 8 U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K U = np.dot(Z, W) #N x K 9 9 9 9 9 9 Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) Yh = softmax(U) ( n )⊤ ( n ) 10 10 10 10 10 10 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) N ( n ) J = − + log u c ∑ n =1 ∑ c y u e 11 11 11 11 11 11 return nll return nll return nll return nll return nll return nll 6 . 1
Recommend
More recommend