deep learning
play

Deep Learning Gradient-based optimization Caio Corro Universit - PowerPoint PPT Presentation

Deep Learning Gradient-based optimization Caio Corro Universit Paris Sud 23 octobre 2019 Table of contents Recall: neural networks The training loop Backpropagation Parameter initialization Regularization Better optimizers 2 / 64


  1. Deep Learning Gradient-based optimization Caio Corro Université Paris Sud 23 octobre 2019

  2. Table of contents Recall: neural networks The training loop Backpropagation Parameter initialization Regularization Better optimizers 2 / 64

  3. Recall: neural networks 3 / 64

  4. Neural network p ◮ x : input features ◮ z (1) , z (2) , z (3) : hidden representation ◮ z (4) : output logits or class weights z (3) z (3) z (3) z (3) z (3) 1 2 3 4 5 ◮ p : probability distribution over classes ◮ θ = { W (1) , b (1) , ... } : parameters ◮ σ : non-linear activation function z (2) z (2) z (2) z (2) z (2) � W (1) x + b (1) � z (1) = σ 1 2 3 4 5 � W (2) z (1) + b (2) � z (2) = σ � W (3) z (2) + b (3) � z (3) = σ z (1) z (1) z (1) z (1) z (1) 1 2 3 4 5 � W (4) z (3) + b (4) � z (4) = σ exp( z (4) ) p = Softmax( z (4) ) i i.e. p i = � j exp( z (4) x 1 x 2 x 3 x 4 ) j 4 / 64

  5. Representation learning: Computer Vision [Lee et al., 2009] 5 / 64

  6. Representation learning: Natural Language Processing [Voita et al., 2019] 6 / 64

  7. The training loop 7 / 64

  8. The big picture Data split and usage ◮ Training set: to learn the parameters of the network ◮ Development (or dev or validation) set: to monitor the network during training ◮ Test set: to evaluate our model at the end Generally you don’t have to split the data yourself: there exists standard splits to allow benchmarking. Training loop 1. Update the parameters the minimize the loss on the training set 2. Evaluate the prediction accuracy on the dev set 3. If not satisfied, go back to 1 4. Evaluate the prediction accuracy on the test set with the best parameters on dev 8 / 64

  9. Pseudo-code function Train ( f , θ, T , D ) function Evaluate ( f , D ) bestdev = −∞ n = 0 for epoch = 1 to E do for x , y ∈ D do Shuffle T y = arg max y f ( x ; θ ) y ˆ for x , y ∈ T do if ˆ y = y then loss = L ( f ( x ; θ ) , y ) n = n + 1 θ = θ − ǫ ∇ loss return n / |D| devacc = Evaluate ( f , D ) if devacc > bestdev then ˆ θ = θ bestdev = devacc return ˆ θ 9 / 64

  10. Further details Sampling without replacement ◮ shuffle the training set ◮ loop over the new order Experimentally it works better than "true" sampling and it seems to also have good theoretical properties [Nagaraj et al., 2019] Verbosity At each epoch, it is useful to display: ◮ mean loss ◮ accuracy on training data ◮ accuracy on dev data ◮ timing information ◮ (sometimes) evaluate on dev several times by epoch 10 / 64

  11. Step-size θ ( t +1) = θ ( t ) − ǫ ( t ) ∇ loss How to choose the step size ǫ ( t +1) ? ⇒ Convex optimization ◮ Nonsummable diminishing step size: � ∞ t =1 ǫ ( t ) = ∞ and lim t →∞ ǫ ( t ) = 0 ◮ Backtracking/exact line search Simple neural network heuristic 1. Start with a small value, e.g. ǫ = 0 . 01 2. If dev accuracy did not improve during the last N epochs: decay the learning rate by a small value α , e.g. ǫ = α ∗ ǫ with α = 0 . 1 Step-size annealing ◮ Step decay: multiple ǫ by α ∈ [0 , 1] every N epochs ◮ Exponential decay: ǫ ( t ) = ǫ (0) exp( − α · t ) ◮ 1 / t decay: ǫ ( t ) = ǫ (0) 1+ α · t 11 / 64

  12. Backpropagation 12 / 64

  13. Scalar input Derivative Let f : R → R be a function and x , y ∈ R be variables such that: y = f ( x ) . For a given x , how does an infinitesimal change of x impact y ? dy f ( x + ǫ ) − f ( x ) dx = f ′ ( x ) = lim ǫ ǫ → 0 Linear approximation Let � f : R → R be function parameterized by a ∈ R defined as follows: � f ( x ; a ) = f ( a ) + f ′ ( a ) · ( x − a ) Then, � f ( x ; a ) is an approximation of f at a . 13 / 64

  14. Scalar input 100 0 − 100 Example − 10 − 5 f ( x ) = x 2 + 2 0 5 10 f ′ ( x ) = 2 x ◮ a = − 6 ◮ Black: f ( x ) � f ( x ; a ) = f ( a ) + f ′ ( a ) · ( x − a ) ◮ Red: � = a 2 + 2 + 2 a ( x − a ) f ( x ; a = − 6) 100 = 2 ax + 2 − a 2 50 Intuition: the sign of f ′ ( a ) gives the slope 0 of the approximation, we can use this − 50 information to move closer to the minimum of f ( x ). 14 / 64 − 10 − 5 0 5 10

  15. Scalar input Chain rule Let f : R → R and g : R → R be two functions and x , y , z be variables such that: z = f ( x ) , i.e. y = g ( f ( x )) = g ◦ f ( x ). y = g ( z ) For a given x , how does an infinitesimal change of x impact y ? dy dx = dy dz · dz dx 15 / 64

  16. Scalar input Example: explicit differentiation f ( x ) = (2 x + 1) 2 = 4 x 2 + 4 x + 1 f ′ ( x ) = 8 x + 4 Example: differentiation using the chain rule dz z = 2 x + 1 dx = 2 dy y = z 2 = f ( x ) dz = 2 z dy dx = dy dz · dz dx = 2 z ∗ 2 = 4(2 x + 1) = 8 x + 4 = f ′ ( x ) 16 / 64

  17. Vector input Let f : R m → R be a function and x ∈ R m , y ∈ R be variables such that: y = f ( x ) . Partial derivative Gradient For a given x , how does an infinitesimal For a given x , how does an infinitesimal change of x i impact y ? change of x impact y ?   ∂ y ∂ y   ∂ x 1 ∂ x i     ∇ x y = ∂ y   ∂ x 2 i.e. each input x j , j � = i is considered as a   ... constant. 17 / 64

  18. Vector input Chain rule Let f : R m → R n and g : R n → R be two functions and x m , z n , y be variables such that: z = f ( x ) , y = g ( z ) For a given x i , how does an infinitesimal change of x i impact y ? � ∂ y ∂ y · ∂ z j = ∂ x i ∂ z j ∂ x i j 18 / 64

  19. Vector example � ∂ z j z = W x + b or z j = W j , i x i + b j = W j , i x i i � ∂ y y = z j = 1 z j j � � ∂ y ∂ y · ∂ z j = = 1 ∗ W j , i ∂ x i ∂ z j ∂ x i j j 19 / 64

  20. Vector example z (1) = ... x ... z (2) = ... z (1) ... y = ... z (2) ... ∂ z (1) · ∂ z (2) ∂ z (2) � � � ∂ y ∂ y ∂ y j k k = = · · ∂ z (2) ∂ z (2) ∂ z (1) ∂ x i ∂ x i ∂ x i k k j j k k ⇒ It is starting to get annoying! 20 / 64

  21. Jacobian Let f : R m → R n be a function and x ∈ R m , y ∈ R n be variables such that: y = f ( x ) . Gradient Jacobian For a given x , how does an infinitesimal For a given x , how does an infinitesimal change of x impact y j ? change of x impact y ?     ∂ y j ∂ y 1 ∂ y 1 ...  ∂ x 1 ∂ x 2    ∂ x 1       ∂ y j ∂ y 2 ∂ y 2   J x y = ∇ x y j = ...     ∂ x 1 ∂ x 2 ∂ x 2     ... ... ... ... 21 / 64

  22. Chain rule using the Jacobian notation Let f : R m → R n and g : R n → R be two functions and x m , z n , y be variables such that: z = f ( x ) , y = g ( z ) Partial notation Gradient+Jacobian notation Let �· , ·� be the dot product operation: � ∂ y ∂ y · ∂ z j = ∂ x i ∂ z j ∂ x i ∇ x y = � J x z , ∇ z y � j       ∂ y ∂ z 1 ∂ z 1 ∂ y ...       ∂ x 1 ∂ x 1 ∂ x 2 ∂ z 1       ∈ R n × m  ∂ y  ∈ R m    ∂ y  ∈ R n ∇ x y = J x z = ∂ z 2 ∂ z 2 ∇ z y = ...       ∂ x 2 ∂ x 1 ∂ x 2 ∂ z 2       ... ... ... ... ... 22 / 64

  23. Forward and backward passes Forward pass Backward pass z (1) = f (1) ( x ; θ (1) ) ∇ θ (1) y = � J θ (1) z (1) , ∇ z (1) y � ↓ ↑ z (2) = f (2) ( z (1) ; θ (2) ) ∇ z (1) y = � J z (1) z (2) , ∇ z (2) y � ∇ θ (2) y = � J θ (2) z (2) , ∇ z (2) y � ↓ ↑ z (3) = f (3) ( z (2) ; θ (3) ) ∇ z (2) y = � J z (2) z (3) , ∇ z (3) y � ∇ θ (3) y = � J θ (3) z (3) , ∇ z (3) y � ↓ ↑ z (4) = f (4) ( z (3) ; θ (4) ) ∇ z (3) y = � J z (3) z (4) , ∇ z (4) y � ∇ θ (4) y = � J θ (4) z (4) , ∇ z (4) y � ↓ ↑ = f (5) ( z (4) ; θ (5) ) ∇ z (4) y ∇ θ (5) y y 23 / 64

  24. Computation Graph (CG) 1/2 ∇ z (1) L ∇ z (2) L ∇ L L ∇ ... ∇ ... ∇ ... ∇ ... × σ × − x + z (1) + z (2) log pick L Softmax ∇ b (1) L y W (1) b (1) W (2) b (2) � W (1) x + b (1) � exp( z (2) y ) z (1) = σ z (2) = W (2) x + b (2) L = − log � y ′ exp( z (2) y ′ ) 24 / 64

  25. Computation Graph (CG) 2/2 σ x L Linear Linear NLL W (1) , b (1) W (2) , b (2) y � W (1) x + b (1) � exp( z (2) y ) z (1) = σ z (2) = W (2) x + b (2) L = − log � y ′ exp( z (2) y ′ ) 25 / 64

  26. Computation Graph (CG) implementation CG construction / Eager forward pass The computation graph is built in topological order ( ∼ order execution of operations): ◮ x , z (1) , z (2) , ..., L : Expression nodes ◮ W (1) , b (1) , ... : Parameter nodes Expression node Parameter node ◮ Persistent values ◮ Values ◮ Gradient ◮ Gradient ◮ Backward operation ◮ Backpointer(s) to antecedents The backward operation and backpointer(s) are null for input operations 26 / 64

  27. Eager forward pass example Non-linear activation function: Projection operation z = Wx + b : z ′ = relu( z ) z = Linear( x , W , b ) function relu ( z ) function Linear ( x , W , b ) ⊲ Create node ⊲ Create node z ′ = ExpressionNode() z = ExpressionNode() ⊲ Compute forward value ⊲ Compute forward value   max(0 , z 1 ) z . value = Wx + b   z ′ . value = max(0 , z 2 ) ⊲ Set backward operation   ... z . d = d_linear ⊲ Set backward operation ⊲ Set backpointers z ′ . d = d_relu z . backptrs = [ W , b ] ⊲ Set backpointers z ′ . backptrs = [ z ] return z return z ′ 27 / 64

Recommend


More recommend