Training Neural Nets COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Training Neural Nets 1 / 40
Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization 6 Network Depth and Batch Normalization 7 Experiments with SGD COMPSCI 371D — Machine Learning Training Neural Nets 2 / 40
The Softmax Simplex The Softmax Simplex y = h ( x ) : R d → Y • Neural-net classifier: ˆ • The last layer of a neural net used for classification is a soft-max layer exp( z ) p = σ ( z ) = 1 T exp( z ) • The net is p = f ( x , w ) : X → P • The classifier is ˆ y = h ( x ) = arg max p = arg max f ( x , w ) • P is the set of all nonnegative real-valued vectors p ∈ R e whose entries add up to 1 (with e = | Y | ): e = { p ∈ R e : p ≥ 0 and def � P p i = 1 } . i = 1 COMPSCI 371D — Machine Learning Training Neural Nets 3 / 40
The Softmax Simplex = { p ∈ R e : p ≥ 0 and � e def P i = 1 p i = 1 } p 3 1 p 2 1 1/3 p 1 1/3 2 1/2 1/3 p 1 1 p 1/2 1 1 • Decision regions are polyhedral: P c = { p c ≥ p j for j � = c } for c = 1 , . . . , e • A network transforms images into points in P COMPSCI 371D — Machine Learning Training Neural Nets 4 / 40
Loss and Risk Loss and Risk (D´ ej` a Vu) • Ideal loss would be 0-1 loss on network output ˆ y • 0-1 loss is constant where it is differentiable! • Not useful for computing a gradient • Use cross-entropy loss on the softmax output p as a proxy loss ℓ ( y , p ) = − log p y • Non-differentiability of ReLU or max-pooling is minor (pointwise), and typically ignored • Risk, as usual: � N L T ( w ) = 1 n = 1 ℓ n ( w ) where ℓ n ( w ) = ℓ ( y n , f ( x n , w )) N • We need ∇ L T ( w ) and therefore ∇ ℓ n ( w ) COMPSCI 371D — Machine Learning Training Neural Nets 5 / 40
Back-Propagation Back-Propagation (0) x n x x (1) x (2) x = p (3) l n = l (1) f (2) f (3) f y w (1) w (2) w (3) n • We need ∇ L T ( w ) and therefore ∇ ℓ n ( w ) = ∂ℓ n ∂ w • Computations from x to ℓ n form a chain • Apply the chain rule • Every derivative of ℓ n w.r.t. layers before k goes through x ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n ∂ w ( k ) = ∂ x ( k ) ∂ w ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n ∂ x ( k − 1 ) = (recursion!) ∂ x ( k ) ∂ x ( k − 1 ) • Start: ∂ x ( K ) = ∂ℓ ∂ℓ n ∂ p COMPSCI 371D — Machine Learning Training Neural Nets 6 / 40
Back-Propagation Local Jacobians (0) x n x x (1) x (2) x = p (3) l n = (1) f (2) f (3) l f y w (1) w (2) w (3) n ∂ x ( k ) ∂ x ( k ) • Local computations at layer k : and ∂ w ( k ) ∂ x ( k − 1 ) • Partial derivatives of f ( k ) with respect to layer weights and input to the layer • Local Jacobian matrices, can compute by knowing what the layer does • The start of the process can be computed from knowing the ∂ x ( K ) = ∂ℓ ∂ℓ n loss function, ∂ p • Another local Jacobian • The rest is going recursively from output to input, one layer ∂ w ( k ) into a vector ∂ℓ n ∂ℓ n at a time, accumulating ∂ w COMPSCI 371D — Machine Learning Training Neural Nets 7 / 40
Back-Propagation The Forward Pass x n x (0) (1) (2) (3) = x x x = p l n l f (1) f (2) f (3) y w (1) w (2) w (3) n ∂ x ( k ) ∂ x ( k ) • All local Jacobians, and ∂ x ( k − 1 ) , are computed ∂ w ( k ) numerically for the current values of weights w ( k ) and layer inputs x ( k − 1 ) • Therefore, we need to know x ( k − 1 ) for training sample n and for all k • This is achieved by a forward pass through the network: Run the network on input x n and store x ( 0 ) = x n , x ( 1 ) , . . . COMPSCI 371D — Machine Learning Training Neural Nets 8 / 40
Back-Propagation Back-Propagation Spelled Out for K = 3 ∂ x ( k ) ∂ℓ n ∂ℓ n x n x = (0) x (1) x (2) x = p (3) l n = l f (1) f (2) f (3) ∂ w ( k ) ∂ x ( k ) ∂ w ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n = y ∂ x ( k − 1 ) ∂ x ( k ) ∂ x ( k − 1 ) w (1) w (2) w (3) n (after forward pass) ∂ℓ n ∂ x ( 3 ) = ∂ℓ ∂ℓ n ∂ w ( 1 ) ∂ p ∂ x ( 3 ) ∂ℓ n ∂ℓ n ∂ w ( 3 ) = ∂ℓ n ∂ℓ n ∂ x ( 3 ) ∂ w ( 3 ) ∂ w = ∂ w ( 2 ) ∂ x ( 3 ) ∂ℓ n ∂ℓ n ∂ x ( 2 ) = ∂ x ( 3 ) ∂ x ( 2 ) ∂ x ( 2 ) ∂ℓ n ∂ℓ n ∂ℓ n ∂ w ( 2 ) = ∂ x ( 2 ) ∂ w ( 2 ) ∂ w ( 3 ) ∂ x ( 2 ) ∂ℓ n ∂ℓ n ∂ x ( 1 ) = ∂ x ( 2 ) ∂ x ( 1 ) (Jacobians in blue are local, ∂ x ( 1 ) ∂ℓ n ∂ℓ n ∂ w ( 1 ) = ∂ x ( 1 ) ∂ w ( 1 ) those in red are what we � � ∂ x ( 1 ) ∂ℓ n ∂ℓ n ∂ x ( 0 ) = want eventually) ∂ x ( 1 ) ∂ x ( 0 ) COMPSCI 371D — Machine Learning Training Neural Nets 9 / 40
Back-Propagation Computing Local Jacobians ∂ x ( k ) ∂ x ( k ) ∂ w ( k ) and ∂ x ( k − 1 ) • Easier to make a “layer” as simple as possible • z = V x + b is one layer (Fully Connected (FC), affine part) • z = ρ ( x ) (ReLU) is another layer • Softmax, max-pooling, convolutional,... COMPSCI 371D — Machine Learning Training Neural Nets 10 / 40
Back-Propagation Local Jacobians for a FC Layer z = V x + b ∂ z • ∂ x = V (easy!) ∂ w : What is ∂ z ∂ z ∂ z i • ∂ V ? Three subscripts: ∂ v jk . A 3D tensor? • For a general package, tensors are the way to go • Conceptually, it may be easier to vectorize everything: � v 11 � b 1 � � v 12 v 13 V = , b = → v 21 v 22 v 23 b 2 w = [ v 11 , v 12 , v 13 , v 21 , v 22 , v 23 , b 1 , b 2 ] T ∂ z • ∂ w is a 2 × 8 matrix • With e outputs and d inputs, an e × e ( d + 1 ) matrix COMPSCI 371D — Machine Learning Training Neural Nets 11 / 40
Back-Propagation Jacobian w for a FC Layer � z 1 � w 1 � w 7 � x 1 � � w 2 w 3 + = x 2 z 2 w 4 w 5 w 6 w 8 x 3 • Don’t be afraid to spell things out: z 1 = w 1 x 1 + w 2 x 2 + w 3 x 3 + w 7 z 2 = w 4 x 1 + w 5 x 2 + w 6 x 3 + w 8 � � ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z ∂ w 1 ∂ w 2 ∂ w 3 ∂ w 4 ∂ w 5 ∂ w 6 ∂ w 7 ∂ w 8 ∂ w = ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ w 1 ∂ w 2 ∂ w 3 ∂ w 4 ∂ w 5 ∂ w 6 ∂ w 7 ∂ w 8 � x 1 � x 2 x 3 0 0 0 1 0 ∂ z ∂ w = 0 0 0 x 1 x 2 x 3 0 1 • Obvious pattern: Repeat x T , staggered, e times • Then append the e × e identity at the end COMPSCI 371D — Machine Learning Training Neural Nets 12 / 40
Stochastic Gradient Descent Training • Local gradients are used in back-propagation • So we now have ∇ L T ( w ) • ˆ w = arg min L T ( w ) • L T ( w ) is (very) non-convex, so we look for local minima • w ∈ R m with m very large: No Hessians • Gradient descent • Even so, every step calls back-propagation N = | T | times • Back-propagation computes m derivatives ∇ ℓ n ( w ) • Computational complexity is Ω( mN ) per step • Even gradient descent is way too expensive! COMPSCI 371D — Machine Learning Training Neural Nets 13 / 40
Stochastic Gradient Descent No Line Search • Line search is out of the question • Fix some step multiplier α , called the learning rate w t + 1 = w t − α ∇ L T ( w t ) • How to pick α ? Cross-validation is too expensive • Tradeoffs: • α too small: Slow progress • α too big: Jump over minima • Frequent practice: • Start with α relatively large, and monitor L T ( w ) • When L T ( w ) levels off, decrease α • Alternative: Fixed decay schedule for α • Better (recent) option: Change α adaptively (Adam, 2015) COMPSCI 371D — Machine Learning Training Neural Nets 14 / 40
Stochastic Gradient Descent Manual Adjustment of α • Start with α relatively large, and monitor L T ( w t ) • When L T ( w t ) levels off, decrease α • Typical plots of L T ( w t ) versus iteration index t : risk COMPSCI 371D — Machine Learning Training Neural Nets 15 / 40
Stochastic Gradient Descent Batch Gradient Descent � N • ∇ L T ( w ) = 1 n = 1 ∇ ℓ n ( w ) . N • Taking a macro-step − α ∇ L T ( w t ) is the same as taking the N micro-steps − α N ∇ ℓ 1 ( w t ) , . . . , − α N ∇ ℓ N ( w t ) • First compute all the N steps at w t , then take all the steps • Thus, standard gradient descent is a batch method: Compute the gradient at w t using the entire batch of data, then move • Even with no line search, computing N micro-steps is still expensive COMPSCI 371D — Machine Learning Training Neural Nets 16 / 40
Stochastic Gradient Descent Stochastic Descent • Taking a macro-step − α ∇ L T ( w t ) is the same as taking the N micro-steps − α N ∇ ℓ 1 ( w t ) , . . . , − α N ∇ ℓ N ( w t ) • First compute all the N steps at w t , then take all the steps • Can we use this effort more effectively? • Key observation: −∇ ℓ n ( w ) is a poor estimate of −∇ L T ( w ) , but an estimate all the same : Micro-steps are correct on average! • After each micro-step, we are on average in a better place • How about computing a new micro-gradient after every micro-step ? • Now each micro-step gradient is evaluated at a point that is on average better (lower risk) than in the batch method COMPSCI 371D — Machine Learning Training Neural Nets 17 / 40
Recommend
More recommend