Introduction Feedforward Neural Networks Does it make sense to use gradient descent ? Convex function A function f : R n �→ R is convex : ⇒ ∀ x 1 , x 2 ∈ R n , ∀ t ∈ [0 , 1] ⇐ 1 f ( tx 1 + (1 − t ) x 2 ) ≤ tf ( x 1 ) + (1 − t ) f ( x 2 ) 2 with f twice diff., ⇒ ∀ x ∈ R n , H = ∇ 2 f ( x ) is postive semidefinite ⇐ i.e. ∀ x ∈ R n , x T Hx ≥ 0 For a convex function f , all local minima are global minima. Our losses are lower bounded, so these minima exist. Under mild conditions, gradient descent and stochastic gradient descent converge, typically � ǫ t = ∞ , � ǫ 2 t < ∞ (cf lectures on convex optimization). 24 / 94
Introduction Feedforward Neural Networks Does it make sense to use gradient descent ? Linear regression with L2 loss is convex Indeed, 2 ( w T x i − y i ) 2 is convex: • Given x i , y i , L ( w ) = 1 ∇ w L = ( w T x i − yi ) x i ∇ 2 w L = x i x T i i x ) 2 ≥ 0 ∀ x ∈ R n x T x i x T i x = ( x T • a non negative weighted sum of convex functions is convex 25 / 94
Introduction Feedforward Neural Networks Linear regression, synthesis Linear regression • samples ( x i , y i ), y i ∈ R , • extend x i by adding a constant dimension equal to 1, accounts for the bias y i = w T x i • Linear model ˆ y , y ) = 1 y − y || 2 • L2 loss L (ˆ 2 || ˆ • by gradient descent ∇ w L ( w , x i , y i ) = ∂ L ∂ ˆ y ∂ w = − ( y i − ˆ y i ) x i ∂ ˆ y 26 / 94
Introduction Feedforward Neural Networks Linear classification, synthesis Linear binary classification ( logistic regression ) • samples ( x i , y i ), y i ∈ [0 , 1] • extend x i by adding a constant dimension equal to 1, accounts for the bias • Linear model w T x y i = σ ( w T x i ), • sigmoid transfer function ˆ 1 • σ ( x ) = 1+exp( − x ) , σ ( x ) ∈ [0 , 1] d dx σ ( x ) = σ ( x )(1 − σ ( x )) • • Cross entropy loss L (ˆ y , y ) = − y log ˆ y − (1 − y ) log(1 − ˆ y ) • by gradient descent : ∇ w L ( w , x i , y i ) = ∂ L ∂ ˆ y ∂ w = − ( y i − ˆ y i ) x i ∂ ˆ y 27 / 94
Introduction Feedforward Neural Networks Linear classification, synthesis Logistic regression is convex Indeed, • Given x i , y i = 1, L 1 ( w ) = − log( σ ( w T x i ) = log(1 + exp( − w T x i )), ∇ w L 1 = − (1 − σ ( w T x i )) x i ∇ 2 w L 1 = σ ( w T x i )(1 − σ ( w T x i )) x i x T i � �� � > 0 • Given x i , y i = 0, L 2 ( w ) = − log(1 − σ ( w T x i )) ∇ w L 2 = σ ( w T x i ) x ∇ 2 w L 2 = σ ( w T x i )(1 − σ ( w T x i )) x i x T i � �� � > 0 • a non negative weighted sum of convex functions is convex 28 / 94
Introduction Feedforward Neural Networks Why L2 loss for linear classification with SGD is bad Compute the gradient to see why... y , y ) = 1 y − y || 2 • Take L2 loss L (ˆ 2 || ˆ y i = σ ( w T x i ) • Take the “linear” model : ˆ d • Check that dx σ ( x ) = σ ( x )(1 − σ ( x )) • Compute the gradient wrt w : ∇ w L ( w , x i , y i ) = ∂ L ∂ ˆ y y i ) σ ( w T x i )(1 − σ ( w T x i )) x i ∂ w = − ( y i − ˆ ∂ ˆ y • If x i is strongly misclassified (e.g. y i = 1, w T x i = − big ) Then σ ( w T x i )(1 − σ ( w T x i )) ≈ 0, i.e. ∇ w L ( w , x i , y i ) ≈ 0 ⇒ stepsize is very small while the sample is misclassified With a cross entropy loss, ∇ w L ( w , x i , y i ) is proportional to the error 29 / 94
Introduction Feedforward Neural Networks Linear classification, synthesis Linear multiclass classification • samples ( x i , y i ), labels y i ∈ [ | 0 , k − 1 | ] • extend x i by adding a constant dimension equal to 1, accounts for the bias • Linear models for each class: w T j x exp( w T j x ) • softmax transfer function : P ( y = j / x ) = ˆ y j = k exp( w T � k x ) • generalization of the sigmoid for a vectorial output • Cross entropy loss L (ˆ y , y ) = − log ˆ y y • by gradient descent ∂ L ∂ ˆ y k � ∇ w j L ( w , x , y ) = = − ( δ j , y − ˆ y j ) x ∂ ˆ y k ∂ w j k 30 / 94
Introduction Feedforward Neural Networks Perceptron and linear separability Perceptrons performs linear separation in a predefined, fixed feature space The XOR xor ( x 1 , x 2 ) xor ( x 1 , x 2 ) = x 1 x 2 + x 1 x 2 ?? x 2 x 1 x 2 1 1 0 1 1 ?? ?? 0 0 1 0 0 1 0 1 x 1 0 1 x 1 x 2 Can we learn the φ j ( x ) ??? 31 / 94
Introduction Feedforward Neural Networks Radial basis functions(RBF) RBF (Broomhead, 1988) • RBF kernel φ 0 ( x ) = 1, φ j ( x ) = exp −|| x − µ j || 2 2 σ 2 j • for regression (L2 loss) or classification (cross entropy loss) • e.g. for regression : y ( x ) = w T φ ( x ) ˆ L ( w , x i , y i ) = || y i − w T φ ( x i ) || 2 • What about the centers and variances ?[Schwenker, 2001] • place them uniformly, randomly, by vector quantization (k-means, GNG [Fritzke, 1994]) • two phases : fix the centers/variances, fit the weights • three phases : fix the centers/variances, fit the weights, fit everything ( ∇ µ L , ∇ σ L , ∇ w L ) 32 / 94
Introduction Feedforward Neural Networks Radial basis functions(RBF) RBF are universal approximators [Park, Sandberg (1991)] Denote S the family of functions based on RBF in R d : � S = { g ∈ R d → R , g ( x ) = w i φ i ( x ) , w ∈ R N } i Then S is dense in L p ( R ) for every p ∈ [1 , ∞ ) Actually, the theorem applies for a larger class of functions φ i . 33 / 94
Introduction Feedforward Neural Networks Feedforward neural networks (or MLP [Rumelhart, 1986]) Input Hidden layers Output Layer 0 Layer 1 · · · Layer L-1 Layer L x 0 w (1) 00 w ( L − 1) w (1) 00 01 a ( L − 1) 0 y ( L − 1) w ( L − 1) x 1 0 0 i w (1) · · · 02 a ( L ) y ( L ) 0 0 · · · x 2 a ( L − 1) a ( L ) y ( L ) 1 y ( L − 1) · · · 1 1 1 x 3 1 1 1 a (1) = � j w (1) a ( L − 1) = � j w ( L − 1) y ( L − 2) a ( L ) = � j w ( L ) ij y ( L − 1) ij x j i i ij j i j y (1) = g ( a (1) y ( L − 1) = g ( a ( L − 1) y ( L ) = f ( a ( L ) i ) ) ) i i i i i Named MLP for historical reasons. Should be called FNN. 34 / 94
Introduction Feedforward Neural Networks Feedforward neural networks Input Hidden layers Output Layer 0 Layer 1 · · · Layer L-1 Layer L x 0 w (1) 00 w ( L − 1) w (1) 00 01 a ( L − 1) 0 y ( L − 1) w ( L − 1) x 1 0 0 i w (1) · · · 02 a ( L ) y ( L ) 0 0 · · · x 2 a ( L − 1) 1 a ( L ) y ( L ) y ( L − 1) · · · 1 1 1 x 3 1 1 1 a (1) = � j w (1) a ( L − 1) = � j w ( L − 1) y ( L − 2) a ( L ) = � j w ( L ) ij y ( L − 1) ij x j i i ij j i j y (1) = g ( a (1) y ( L − 1) = g ( a ( L − 1) y ( L ) = f ( a ( L ) i ) ) ) i i i i i Architecture • Depth : number of layers without counting the input deep = large depth • Width : number of units per layer • weights and biases for each unit • Hidden transfer function f , Output transfer function g 35 / 94
Introduction Feedforward Neural Networks Feedforward neural networks Architecture Hidden transfer function • Historically, f taken as a sigmoid or tanh. • Now, mainly Rectified Linear Units (ReLu) or similar f ( x ) = max( x , 0) ReLu are more favorable for the gradient flow than the saturating functions [Krizhevsky(2012), Nair(2010), Jarrett(2009)] 1.0 1.0 5 0.8 4 0.5 0.6 3 0.0 4 2 0 2 4 0.4 2 0.5 0.2 1 0.0 0 1.0 4 2 0 2 4 4 2 0 2 4 36 / 94
Introduction Feedforward Neural Networks Feedforward neural networks Architecture Output transfer function and loss • for regression : • linear f ( x ) = x y || 2 • L2 loss L (ˆ y , y ) = || y − ˆ • for multiclass classification: e aj • softmax ˆ y j = k e ak , � • negative log likelihood loss L (ˆ y , y ) = − log(ˆ y y ) 37 / 94
Introduction Feedforward Neural Networks FNN training : error backpropagation Training by gradient descent • initialize weights and biases w 0 • at every iteration, compute : w ← w − ǫ ∇ w J ∂ J The partial derivatives ∂ w i ?? Fundamentally, use the chain rule within the computational graph linking any variable (inputs,weights, biases) to the output of the loss. Backprop is usually attributed to [Rumelhart,1986] but [Werbos,1981] already introduced the idea. 38 / 94
Introduction Feedforward Neural Networks Computing partial derivatives Computational graph A computational graph is a directed graph • nodes : variables (weights, inputs, outputs, targets,..) • edges : operations (ReLu, Softmax, w T x + b , .., Losses,..) We only need to know, for each operations : • the partial derivatives wrt its parameters • the partial derivatives wrt its inputs 39 / 94
Introduction Feedforward Neural Networks ∂ J Computing partial derivatives ∂ w i The chain rule : single path Suppose there is a single path, e.g. x i → u 3 Applying the chain rule ∂ u 3 ∂ x i = ∂ u 3 ∂ u 2 . ∂ u 2 ∂ u 1 . ∂ u 1 ∂ x i ( y i − wx i − b ) 2 ∂ ∂ x i = u 3 = u 2 2 ∂ u 3 u 2 = y i − u 1 = 2 u 2 . ( − 1) . w = − 2 w ( y i − wx i − b ) ∂ x i u 1 = wx i + b 40 / 94
Introduction Feedforward Neural Networks ∂ J Computing partial derivatives ∂ w i The chain rule : multiple paths Sum over all the paths (e.g. u 3 = w 1 x i + w 2 x i ): ∂ u 3 ∂ u 3 ∂ u j � = ∂ x i ∂ u j ∂ x i j ∈{ 1 , 2 } u 3 = u 1 + u 2 ∂ u 3 u 2 = w 2 x i = 1 . w 2 +1 . w 1 ∂ x i u 1 = w 1 x i 41 / 94
Introduction Feedforward Neural Networks But, it is computationally expensive There are a lot of paths... There are 4 paths from x i to u 5 42 / 94
Introduction Feedforward Neural Networks Let us be more efficient: Forward-mode differentiation Forward differentiation Idea : to compute ∂ u 5 ∂ ∂ x i , propagate forward ∂ x i ∂ u 1 ∂ x i = ∂ u 1 ∂ z 1 ∂ x i + ∂ u 1 ∂ z 2 ∂ x i = z 2 . 1 + z 1 . 0 = z 2 = w 1 But how to ∂ z 1 ∂ z 2 compute ∂ u 5 ∂ ∂ w 1 ? Well, propagate ∂ w 1 And ∂ u 5 ∂ w 2 ? propagate again...... or.... Griewank(2010) Who Invented the Reverse Mode of Differentiation? http://colah.github.io/posts/2015-08-Backprop/ 43 / 94
Introduction Feedforward Neural Networks Let us be even more efficient: reverse-mode differentiation Reverse differentiation Idea : to compute ∂ u 5 ∂ x i , back propagate ∂ u 5 ∂ ∂ u 5 ∂ u 1 = ∂ u 5 ∂ u 1 + ∂ u 5 ∂ u 3 ∂ u 4 We have ∂ u 5 ∂ u 1 = 1 . w 3 + 1 . w 6 ∂ x i , but also ∂ u 3 ∂ u 4 ∂ w 1 , ∂ u 5 ∂ u 5 ∂ w 2 ,... all in a single pass ! Griewank(2010) Who Invented the Reverse Mode of Differentiation? 44 / 94 http://colah.github.io/posts/2015-08-Backprop/
Introduction Feedforward Neural Networks FNN training : error backpropagation In Neural Networks, reverse-mode differentation is called error backpropagation Training in two phases • Evaluation of the output : forward propagation • Evaluation of the gradient : reverse-mode differentation Carefull The reverse-mode differentation uses the activations computed during the forward propagation Libraries like theano augment the computational graph with nodes computing numerically the gradient by reverse-mode differentiation. 45 / 94
Introduction Feedforward Neural Networks Universal approximator Any well behaved functions can be arbitrarily approximated with a single hidden layer FNN . Intuition 1 • Take a sigmoid transfer function f ( x ) = 1+ exp ( − α ( x − b i )) : this is the hidden layer • substract two such activations to get gaussian like kernels 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 • weight such substractions, you are back to the RBFs 46 / 94
Introduction Feedforward Neural Networks But then, why deep networks ?? Going deeper • Single FNN are universal approximators, but the hidden layer can be arbitrarily large • A deep network (large number of layers) builds high level features by composing lower level features • A shallow directly learns these high level features • Image analogy : • first layers : extract oriented contours (e.g. gabors) • second layers : learn corners by combining contours • next layers : build up more and more complex features • Theoretical works comparing expressiveness depth { d } FNN with depth { d-1 } FNN Learning deep architectures for AI, Bengio(2009), chap2; Benefits of depth in 47 / 94
Introduction Feedforward Neural Networks And why ReLu ? Vanishing/exploding gradient [Hochreiter(1991),Bengio(1994)] Consider u 2 = f ( wu 1 ) • Remember when the gradient is “backpropagated”, it involves ∂ u 1 = ∂ J ∂ J ∂ u 1 = ∂ J ∂ u 2 ∂ u 2 w . f ′ ( wu 1 ) ∂ u 2 • backpropagated through L layers ( w . f ′ ) L 1+ e − x , f ′ ( x ) < 1 for x � = 0 1 • with f ( x ) = • If w . f ′ � = 1, ( w . f ′ ) L → { 0 , ∞} • ⇒ gradient vanishes or explodes With ReLu, f ′ ( x ) ∈ { 0 , 1 } . But you can get dead units 48 / 94
Introduction Feedforward Neural Networks But the ReLus can die.... Why do they die ? If the input to ReLu is negative, the gradient is 0, that’s it...”forever” lost And then ? • Add a linear component for negative x : Leaky Relu, Parametric Relu [He(2015)] • Exponential Linear Units [Clevert, Hochreiter(2016)] 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 4 2 0 2 4 4 2 0 2 4 0 4 2 0 2 4 1 1 49 / 94
Introduction Feedforward Neural Networks How to deal with the vanishing/exploding gradient Preventing vanishing gradient by preserving gradient flow • Using ReLu, Leaky Relu, PReLu, ELU to ensure a good flow of gradient • Specific architectures : • ResNet (CNN) : shortcut connections • LSTM (RNN) : constant error caroussel Preventing exploding gradients Gradient clipping [Pascanu, 2013] : clip the norm of the gradient 1 unit, σ ( x ), 50 lay- ers 50 / 94
Introduction Feedforward Neural Networks Regularization Like kids, FNN can do a lot of things, but we must focus their expressiveness Chap 7, [Bengio et al.(2016)] 51 / 94
Introduction Feedforward Neural Networks Regularization L2 regularization Add a L2 penalty on the weights, α > 0. J ( w ) = L ( w ) + α 2 = L ( w ) + α 2 || w || 2 2 w T w ∇ w J = ∇ w L + α w Example : RBF, 1 kernel per sample, N=30, noisy sinus. 1.0 1.0 0.6 0.8 0.8 0.4 0.2 0.6 0.6 0.0 0.4 0.4 0.2 0.2 0.2 0.4 0.0 0.0 0.6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 Shown are : α = 0 , α = 2 , w ⋆ Chap 7 of [Bengio et al.(2016)] for a geometrical interpretation 52 / 94
Introduction Feedforward Neural Networks Regularization L2 regularization In principle, We should not regularize the bias. Example : N J ( w ) = 1 � � w k x i , k || 2 || y i − w 0 − N i =1 k ≥ 1 ∇ w 0 J = 0 ⇒ w 0 = ( 1 1 � � � y i ) − w k x i , k N N i k ≥ 1 i � 1 e.g. if your data are centered, i.e. i x i , k = 0, then N � w 0 = 1 i y i . N Regularizing the bias might lead to underfitting. 53 / 94
Introduction Feedforward Neural Networks Regularization L1 regularization promotes sparsity Add a L1 penalty on the weights. � J ( w ) = L ( w ) + α || w || 1 = L ( w ) + α | w k | k � ∇ w J = ∇ w L + α sign( w k ) k Example : RBF, 1 kernel per sample, N=30, noisy sinus, α = 0 . 003. 1.0 1.0 0.6 0.8 0.8 0.4 0.2 0.6 0.6 0.0 0.4 0.4 0.2 0.2 0.2 0.4 0.0 0.0 0.6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 Shown are : α = 0 , α = 0 . 003 , w ⋆ 54 / 94
Introduction Feedforward Neural Networks Regularization Drop-out regularization [Srivastava(2014)] Idea 1: to prevent co-adaptation. A unit is good by itself not because others are doing part of the job. Idea 2 : combine an exponential number of networks (Ensemble). How : • For each minibatch, keep hidden and input activations with probability p (p=0.5 for hidden, p=0.8 for inputs). At test time, multiply all the activations by p • “Inverted”: scale by 1/p at training ; no scaling at test time 55 / 94
Introduction Feedforward Neural Networks Regularization (dropout) ,Srivastava(2014), Hinton(2012) Usually after FC layers (p=0.5) and input layer . Can be interpreted as of training/averaging all the possible subnetworks. 56 / 94
Introduction Feedforward Neural Networks Regularization Split your data in three sets : • training set : for training.. • validation set : for choosing hyperparameters • test set : for estimating the generalization error Early stopping Idea : monitor your error on the validation set, U-shaped performance. Keep the model with the lowest validation error during training. 57 / 94
Introduction Feedforward Neural Networks Training by some forms of gradient descent w ( t + 1) ← w ( t ) − ǫ ∇ w J ( w t ) * Chap 8 [Bengio et al. (2016)] * A. Karpathy : http://cs231n.github.io/neural-networks-3/ 58 / 94
Introduction Feedforward Neural Networks But wait... Does it make sense to apply gradient descent to Neural networks ? • we cannot get better than a local minima ?!?! • and neural networks lead to non convex optimization problems, i.e. a lot of local minima (think about the symmetries) But empirically, most local minima are close to the global minima with large/deep networks . Choromanska, 2015 : The Loss Surface of Multilayer Nets Dauphin, 2014 : Identifying and attacking the saddle point problem in high-dimensional non-convex optimization Pascanu, 2014 : On the saddle point problem for non-convex optimization 59 / 94
Introduction Feedforward Neural Networks Identifying the critical points The hessian matrix • Matrix of second order derivatives. Inform on the local curvature ∂ 2 f ∂ 2 f ∂ 2 f · · · ∂ x 2 ∂ x 1 ∂ x 2 ∂ x 1 ∂ x n 1 ∂ 2 f ∂ 2 f ∂ 2 f · · · ∂ x 2 ∂ x 2 ∂ x 1 ∂ x 2 ∂ x n H ( θ ) = ∇ 2 2 θ J = . . . ... . . . . . . ∂ 2 f ∂ 2 f ∂ 2 f · · · ∂ x 2 ∂ x n ∂ x 1 ∂ x n ∂ x 2 n • for a convex function : H is symmetric, semi definite positive 60 / 94
Introduction Feedforward Neural Networks Identifying the type of critical points Eigenvalues of H • a critical point is where ∇ θ J = 0 • if all eigenvalues(H) > 0 : local minima • if all eigenvalues(H) < 0 : local maxima • if eigenvalues(H) pos and neg : saddle point x 2 + y 2 − ( x 2 + y 2 ) x 2 − y 2 61 / 94
Introduction Feedforward Neural Networks Identifying the type of critical points And if H is degenerate If H is degenerate (some eigenvalues=0, det ( H ) = 0), we can have : • a local minimum : � − 12 x 2 � 0 f ( x , y ) = − ( x 4 + y 4 ) , H ( x , y ) = − 12 y 2 0 • a local maximum : � 12 x 2 � 0 f ( x , y ) = x 4 + y 4 , H ( x , y ) = 12 y 2 0 � 6 x � 0 • a saddle point : f ( x , y ) = x 3 + y 2 , H ( x , y ) = 0 2 62 / 94
Introduction Feedforward Neural Networks Local minima are not an issue with deep networks Local minima and their loss Experiment : One hidden layer, Mnist, Trained with SGD. Converges mostly to local minima and some saddle points. The distribution of test loss of the local minima tend to shrink. Index α : fraction of negative eigenvalues of the Hessian Choromanska, 2015 : The Loss Surface of Multilayer Nets 63 / 94
Introduction Feedforward Neural Networks Saddle points seem to be the issue Saddle points and their loss Experiment : “small MLP”, Trained with Saddle-free Newton (converges to critical points). Used Newton to discover the critical points High loss critical points are saddle points Low loss critical points are local minima. Index α : fraction of negative eigenvalues of the Hessian Dauphin, 2014 : Identifying and attacking the saddle point problem in high-dimensional non-convex optimization 64 / 94
Introduction Feedforward Neural Networks Training 1st order methods J ( θ ) ≈ J ( θ 0 ) + ( θ − θ 0 ) T ∇ θ J ( θ 0 ) θ ← θ − ǫ ∇ θ J Rationale : First order approximation ˆ J ( θ ) = J ( θ 0 ) + ( θ − θ 0 ) T ∇ θ J ( θ 0 ) ( θ − θ 0 ) = − ǫ ∇ θ J ( θ 0 ) ⇒ ˆ J ( θ ) = J ( θ 0 ) − ǫ ( ∇ θ J ( θ 0 )) 2 For small || ǫ ∇ θ J ( θ 0 ) || : J ( θ 0 − ǫ ∇ θ J ( θ 0 )) ≤ J ( θ 0 ) 65 / 94
Introduction Feedforward Neural Networks Training, 1st order methods Minibatch Stochastic Gradient descent • start at θ 0 • for every minibatch : θ ( t + 1) = θ ( t ) − ǫ ∇ θ ( 1 � J i ( θ )) M i • M = 1 : very noisy estimate, Stochastic gradient descent • M = N : true gradient, batch gradient descent • (minibatch)SGD converges faster • The trajectory may converge slowly or diverge if ǫ not appropriate 66 / 94
Introduction Feedforward Neural Networks Training, 1st order methods Stochastic Gradient descent : example Take N = 30 samples with y = 3 x + 2 + U ( − 0 . 1 , 0 . 1) Let us perform linear regression (ˆ y = wx + b , L2 loss) with SGD 40 30 20 10 y 0 10 20 30 10 5 0 5 10 x 67 / 94
Introduction Feedforward Neural Networks Training, 1st order methods Stochastic Gradient descent : ZigZag ǫ = 0 . 005 , b 0 = 10 , w 0 = 5 Converges to w ⋆ = 2 . 9975 , b ⋆ = 1 . 9882 2 10 1 0 5 1 log(J) w 2 0 3 4 5 5 5 0 5 10 0 200 400 600 800 1000 b iteration 68 / 94
Introduction Feedforward Neural Networks Training, 1st order methods Momentum Idea: let us damp the oscillations with a low-pass filter on ∇ θ • Start at θ 0 , v = 0 • for every minibatch : v ( t + 1) = α v ( t ) − ǫ ∇ θ θ ( t + 1) = θ ( t ) + v ( t ) Usually, α ≈ 0 . 9 Experiment on http://distill.pub/2017/momentum/ 69 / 94
Introduction Feedforward Neural Networks Training, 1st order methods Stochastic Gradient descent with momentum ǫ = 0 . 005 , α = 0 . 6 , b 0 = 10 , w 0 = 5 Converges to w ⋆ = 2 . 9933 , b ⋆ = 1 . 9837 2 10 1 0 5 1 log(J) w 2 0 3 4 5 5 5 0 5 10 0 200 400 600 800 1000 b iteration Adviced : set α ∈ { 0 . 5 , 0 . 9 , 0 . 99 } 70 / 94
Introduction Feedforward Neural Networks Training, 1st order methods SGD without/with momentum 2 2 1 1 0 0 1 1 log(J) log(J) 2 2 3 3 4 4 5 5 0 200 400 600 800 1000 0 200 400 600 800 1000 iteration iteration 71 / 94
Introduction Feedforward Neural Networks Training, 1st order methods Nesterov Momentum [Sutskever, PhD Thesis] Idea : look ahead to potentially correct the update. Based on Nesterov Accelerated Gradient • Start at θ 0 , v = 0 • for every minibatch : ˜ θ ( t + 1) = θ ( t ) + α v ( t ) v ( t + 1) = α v ( t ) − ǫ ∇ θ J (˜ θ ) θ ( t + 1) = θ ( t ) + v ( t + 1) 72 / 94
Introduction Feedforward Neural Networks Training, 1st order methods SGD with Nesterov Momentum ǫ = 0 . 005 , α = 0 . 8 , b 0 = 10 , w 0 = 5 Converges to w ⋆ = 2 . 9914 , b ⋆ = 1 . 9738 2 10 1 0 5 1 log(J) w 2 0 3 4 5 5 5 0 5 10 0 200 400 600 800 1000 b iteration In this experiment, with nesterov momentum, a larger momentum was allowed. With α = 0 . 8, momentum strongly oscillates. 73 / 94
Introduction Feedforward Neural Networks Training, 1st order methods SGD/ SGD momentum / SGD nesterov momentum 2 2 2 1 1 1 0 0 0 1 1 1 log(J) log(J) log(J) 2 2 2 3 3 3 4 4 4 5 5 5 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 iteration iteration iteration 74 / 94
Introduction Feedforward Neural Networks Training 1st order methods with adaptive learning rates 75 / 94
Introduction Feedforward Neural Networks Training : adapting the learning rate Learning rate annealing Some possible schedules : • Linear decay between ǫ 0 and ǫ τ . • halve the learning rate when validation error stops improving 76 / 94
Introduction Feedforward Neural Networks Training : adapting the learning rate Adagrad [Duchi, 2011] • Accumulate the square of the gradients : r ( t + 1) = r ( t ) + ∇ θ J ( θ ( t )) ⊙ ∇ θ J ( θ ( t )) • Scale individually the learning rates δ + √ ǫ θ ( t + 1) = θ ( t ) − r ( t +1) ⊙ ∇ θ J ( θ ( t )) The √ . is experimentally critical. δ ≈ [1 e − 8 , 1 e − 4], for numerical stability Small gradients ⇒ bigger learning rate Big gradients ⇒ smaller learning rate Accumulation from the beginning is too agressive. Learning rates decrease too fast. 77 / 94
Introduction Feedforward Neural Networks Training : adapting the learning rate RMSProp [Hinton, unpublished] Idea : Use an exponentially decaying integration of the gradient • Accumulate the square of the gradients : r ( t + 1) = ρ r ( t ) + (1 − ρ ) ∇ θ J ( θ ( t )) ⊙ ∇ θ J ( θ ( t )) • Scale individually the learning rates δ + √ ǫ θ ( t + 1) = θ ( t ) − r ( t +1) ⊙ ∇ θ J ( θ ( t )) ρ ≈ 0 . 9. And some others : Adadelta [Zeiler, 2012], Adam [Kingma, 2014], ... 78 / 94
Introduction Feedforward Neural Networks Training with 1st order methods So, which one do I use ? [Bengio et al. 2016] There is currently no consensus[...]no single best algorithm has emerged[...]the most popular and actively in use include SGD, SGD with momentum, RMSprop, RMSprop with momentum, Adadelta and Adam Loading... A. Karpathy Schaul(2014). Unit Tests for Stochastic Optimization 79 / 94
Introduction Feedforward Neural Networks Training A glimpse into 2nd order methods J ( θ ) ≈ J ( θ 0 ) + ( θ − θ 0 ) T ∇ θ J ( θ 0 ) + 1 2( θ − θ 0 ) T ∇ 2 J ( θ 0 )( θ − θ 0 ) ∇ θ J ( θ 0 ) Gradient vector ∇ 2 θ J ( θ 0 ) Hessian matrix Idea : use a better local approximation to make a more informed update 80 / 94
Introduction Feedforward Neural Networks Training : 2nd order methods Newton method From a 2nd order taylor approximation J ( θ ) ≈ J ( θ 0 ) + ( θ − θ 0 ) T ∇ θ J ( θ 0 ) + 1 2( θ − θ 0 ) T ∇ 2 J ( θ 0 )( θ − θ 0 ) Critical point at : ∇ θ J ( θ ) = 0 ⇒ θ = θ 0 − ǫ H − 1 ∇ w J ( θ 0 ) Critical points (min, max, saddle) are attractors for Newton !! • cool : we can locate critical points • but : do not use it for optimizing a neural network ! Dauphin(2014) : Identifying and attacking the saddle point problem in high-dimensional non-convex optimization 81 / 94
Introduction Feedforward Neural Networks Training : 2nd order methods Second order methods require a larger batchsize. Some algorithms • Conjugate gradient : no need to compute the hessian, guaranteed to converge in k steps for k dimensional quadratic function, • Saddle-free Newton [Daupin, 2014] • Hessian free optimization (truncated newton) [Martens, 2010] • BFGS (quasi-newton): approximation of H − 1 , needs to store it which is larged for deep networks, • L-BFGS : limited memory BFGS 82 / 94
Introduction Feedforward Neural Networks Initialization and the importance of good activation distributions 83 / 94
Introduction Feedforward Neural Networks Preprocessing your inputs Gradient descent converges faster if your data are normalized and decorrelated. Denote by x i ∈ R d your input data, ˆ x i its normalized Input normalization • Min-max scaling : x i , j − min k x k , j ∀ i , j ˆ x i , j = max k x k , j − min k x k , j + ǫ • Z-score normalization : (goal: ˆ µ j = 0 , ˆ σ j = 1) x i , j = x i , j − µ j ∀ i , j , ˆ σ j + ǫ X T = I n − 1 ˆ X ˆ 1 • ZCA-Whitening : (goal: ˆ µ j = 0 , ˆ σ j = 1, 1 ˆ √ n − 1( XX T ) − 1 / 2 X = WX , W = 84 / 94
Introduction Feedforward Neural Networks Z-score normalization / Standardizing the inputs Remember our linear regression : y = 3 x + 2 + U ( − 0 . 1 , 0 . 1), L2 loss, 30 1D samples 25 10 20 5 w w 15 0 10 5 5 0 5 10 0 5 10 15 b b Loss with raw input Loss with standardized input With standardized inputs, the gradient always points to the minimum !! 85 / 94
Introduction Feedforward Neural Networks The starting point of training is critical Pretraining Historically, training deep FNN was known to be hard, i.e. bad generalization errors. The starting point of a gradient descent has a dramatic impact. • neural history compressors [Schmidhuber, 1991] • competitive learning [Maclin and Shavlik, 1995] • unsupervised pretraining based on Boltzman Machines [Hinton, 2006] • unsupervised pretraining based on autoencoders [Bengio, 2006] 86 / 94
Introduction Feedforward Neural Networks For example, pretraining with autoencoders Idea : extract features that allow to reconstruct the previous layer activities Followed by fine-tuning with gradient descent Does not appear to be that critical nowadays (because of xxReLu and initialization strategies) 87 / 94
Introduction Feedforward Neural Networks Initializing the weights/biases Thoughts • intially behave as a linear predictor; Non linearities should be activated by the learning algorithm only if necessary. • units should not extract the same features : symmetry breaking , otherwise, same gradients. Suppose the inputs standardized, make output and gradients standardized: 1 • sigmoid : b = 0 , w ∼ N (0 , fanin ) ⇒ in the linear part √ √ √ 6 6 • sigm, tanh : b = 0 , w ∼ U ( − √ n i + n o , √ n i + n o ) [Glorot, 2010] � • ReLu : b = 0, w ∼ N (0 , 2 / fanin ) [He(2015)] 88 / 94
Introduction Feedforward Neural Networks LeCun initialization Initialization in the linear regime for the forward pass Aim : Initialize the weights so that f acts in its linear part, i.e. w close to 0 • Use the symmetric transfer function f ( x ) = 1 . 7159 tanh( 2 3 x ) ⇒ f (1) = 1, f ( − 1) = − 1 • Center, normalize (unit variance) and decorrelate the input dimensions 1 • initialize the weights from a distrib with µ = 0, σ = √ ni • set the biases to 0 • This ensures the output of the layer is zero mean, unit variance Efficient Backprop, Lecun et al. (1998); Generalization and network design strategies, LeCun (1989) 89 / 94
Introduction Feedforward Neural Networks Glorot initialization strategy Keep same distribution for the forward and backward pass • The activations and the gradients should have, initially, similar distributions accross the layers • to avoid vanishing/exploding gradient • The input dimensions should centered, normalized, uncorrelated • With a transfer function f , f ′ (0) = 1, it turns to : 2 ∀ i , Var [ W i ] = fanin + fanout √ √ 6 6 Glorot (Xavier) Uniform : W ∼ U [ − √ ni + no , √ ni + no ] , b = 0 √ 2 Glorot (Xavier) Normal : W ∼ N (0 , fanin + fanout ) , b = 0 √ Understanding the difficulty of training deep feedforward neural networks, Glorot, Bengio, JMLR(2010). 90 / 94
Introduction Feedforward Neural Networks He initialization strategy Designed for rectifier non linearities (ReLU, PReLU). Keep same distribution for the forward and backward pass • The activations and the gradients should have, initially, similar distributions accross the layers • The input dimensions should centered, normalized, uncorrelated • With a ReLU transfer function, d Conv k × k filters on c channels : 1 2 k 2 cVar [ w l ] = 1, 1 2 k 2 dVar [ w l ] = 1 √ √ 6 6 He Uniform : W ∼ U [ − ni , ni ] , b = 0 √ √ √ 2 He Normal : W ∼ N (0 , √ n i ) Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, He et al, ICCV(2015). 91 / 94
Introduction Feedforward Neural Networks Batch normalization [Ioffe, Szegedy(2015)] Internal Covariate Shift Def [Ioffe(2015)]: The change in the distribution of network activations due to the change in network parameters during training Exp : 3 FC(100 units), sigmoid, output softmax, MNIST Measure: distribution of activations of the last hidden layer during training, { 15,50,85 } th percentile 92 / 94
Introduction Feedforward Neural Networks Batch normalization [Ioffe, Szegedy(2015)] Batch normalization to prevent covariate shift Idea: standardize the activations of every layers to keep the same distributions during training . • The gradient must be aware of this normalization, otherwise may get parameter explosion (see Ioffe(2015)) • Introduces a differentiable BN normalization layer : z = g ( Wu + b ) → z = g ( BN ( Wu + b )) y i = BN γ,β ( x i ) = γ ˆ x i + β x i − µ B x i ˆ = � σ 2 B + ǫ µ B , σ 2 B : minibatch mean, variance 93 / 94
Introduction Feedforward Neural Networks Batch Normalization Train and test time • Where : everywhere along the network, before ReLus • at training : standardize each unit’s activations over a minibatch • at test : • with one sample, standardize over the population • use mean/variance from the train set • standardize over a batch of test samples Learning much faster, better generalization 94 / 94
Convolutional Neural Networks (CNN) Neocognitron [Fukushima(1980)] LeNet5 [LeCun(1998)] 1 / 35
Idea : Exploiting the structure of the inputs Ideas • Features detected by convolutions with local kernels • parameters sharing, sparse weights ⇒ strongly regularized FNN (e.g. detecting an oriented edge is translation invariant) 2 / 35
The CNN of LeCun(1998) Architecture • (Conv/NonLinear/Pool) * n • followed by fully connected layers 3 / 35
General architecture of a CNN) Architecture : Conv/ReLu/Pool 1 kernel s Bias s Max ReLu 5 4 3 2 1 4 2 0 0 2 4 3 channels K kernels K feature maps K feature maps K feature maps • Convolution : depth, size (3x3, 5x5), pad, stride • Max pooling : size, stride (e.g. (2,2)) 4 / 35
Recent CNN Multicolumn CDNN, Ciressan(2012) Ensemble of Convolutional neural networks trained with dataset augmentation. 0.23 % test misclassification on MNIST. 1.5 million of parameters. 5 / 35
Recent CNN SuperVision, Krizhevsky(2012) • top 5 error of 16% compared to runner-up with 26% error. • several convolutions were stacked without pooling, • trained on 2 GPUs, for a week • 60 Millions parameters, dropout, momentum, L2 penalty, dataset augmentation (trans, reflections, PCA) 6 / 35
Recommend
More recommend