Deep learning Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of technology October 9, 2019 Hamid Beigy | Sharif university of technology | October 9, 2019 1 / 57
Deep learning Table of contents 1 Optimization 2 Model selection 3 Regularization 4 Dataset augmentation 5 Bagging 6 Dropout 7 Reading Hamid Beigy | Sharif university of technology | October 9, 2019 2 / 57
Deep learning | Optimization Table of contents 1 Optimization 2 Model selection 3 Regularization 4 Dataset augmentation 5 Bagging 6 Dropout 7 Reading Hamid Beigy | Sharif university of technology | October 9, 2019 2 / 57
Deep learning | Optimization Batch gradient descent 1 Batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters θ . θ = θ − η · ∇ θ J ( θ ) 2 We need to calculate the gradients for the whole dataset to perform just one update. 3 Batch gradient descent can be very slow and is intractable for datasets that don’t fit in memory. 4 Batch gradient descent also doesn’t allow us to update our model online, i.e. with new examples on-the-fly. Hamid Beigy | Sharif university of technology | October 9, 2019 3 / 57
Deep learning | Optimization Stochastic gradient descent 1 Stochastic gradient descent (SGD) performs a parameter update for each training example x ( i ) and label y ( i ) . θ = θ − η · ∇ θ J ( θ ; x ( i ) ; y ( i ) ) 2 Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. 3 SGD does away with this redundancy by performing one update at a time. 4 It is therefore usually much faster and can also be used to learn online. 5 SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily. Hamid Beigy | Sharif university of technology | October 9, 2019 4 / 57
Deep learning | Optimization Mini-batch gradient descent 1 Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples. θ = θ − η · ∇ θ J ( θ ; x ( i : i + n ) ; y ( i : i + n ) ) 2 This method reduces the variance of the parameter updates, which can lead to more stable convergence 3 It can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient (mini-batch very efficient). 4 Common mini-batch sizes range between 50 and 256, but can vary for different applications. Hamid Beigy | Sharif university of technology | October 9, 2019 5 / 57
Deep learning | Optimization Mini-batch gradient descent (Challenges) Mini-batch gradient descent does not guarantee good convergence and offers a few challenges that need to be addressed. 1 Choosing a proper learning rate can be difficult. 2 Choosing the parameters (schedules and thresholds) of learning rate schedules is difficult. 3 Are we using the same learning rate for all parameters? 4 How to avoid from getting trapped in suboptimal local minima. Hamid Beigy | Sharif university of technology | October 9, 2019 6 / 57
Deep learning | Optimization Momentum 1 Momentum 1 is a method that helps accelerate SGD in the relevant direction and dampens oscillations. 2 It does this by adding a fraction γ of the update vector of the past time step to the current update vector: v t = γ v t − 1 + η ∇ θ J ( θ ) θ = θ − v t 1 Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1), 145–151. Hamid Beigy | Sharif university of technology | October 9, 2019 7 / 57
Deep learning | Optimization Nesterov accelerated gradient 1 A ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. 2 We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. 3 Nesterov accelerated gradient (NAG) 2 is a way to give our momentum term this kind of prescience. v t = γ v t − 1 + η ∇ θ J ( θ − γ v t − 1 ) θ = θ − v t The value of θ − γ v t − 1 gives an approximation of the next position of the parameters. 2 Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, pp. 543– 54 Hamid Beigy | Sharif university of technology | October 9, 2019 8 / 57
Deep learning | Optimization Adagrad 1 Adagrad 3 is an algorithm for gradient-based optimization that adapts the learning rate to the parameters. 2 Adagrad updates the parameters in the following manner. η θ t +1 , i = θ t , i − G t , ii + ϵ · ∇ θ J ( θ t , i ) √ where G t ∈ R d × d is a diagonal matrix where each diagonal element ( i , i ) is the sum of the squares of the gradients w.r.t. θ i . ϵ is a smoothing term that avoids division by zero. 3 Adadelta 4 is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. 3 Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159. 4 Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. Arxiv. Hamid Beigy | Sharif university of technology | October 9, 2019 9 / 57
Deep learning | Optimization Adam 1 Adaptive Moment Estimation (Adam) 5 computes adaptive learning rates for each parameter. 2 Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface. 3 Adam computes the decaying averages of past and past squared gradients m t and v t . m t = β 1 m t − 1 + (1 − β 1 ) ∇ θ J ( θ t ) (1) v t = β 2 v t − 1 + (1 − β 2 )( ∇ θ J ( θ t )) 2 where m t and v t are estimates of the first moment and the second moment of the gradients, respectively, 5 Kingma, D. P., and Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13. Hamid Beigy | Sharif university of technology | October 9, 2019 10 / 57
Deep learning | Optimization Adam (cont.) 1 Initially m t and v t are set to 0. 2 m t and v t are biased towards zero. 3 Bias-corrected m t and v t are m t m t = ˆ 1 − β t 1 (2) v t v t = ˆ 1 − β t 2 Then Adam updates parameters as η θ t +1 = θ t − √ ˆ v t + ϵ ˆ m t Hamid Beigy | Sharif university of technology | October 9, 2019 11 / 57
Deep learning | Optimization Other optimization algorithms) 1 AdaMax changed parameter v t of Adam 2 Nadam (Nesterov-accelerated Adaptive Moment Estimation) combines Adam and NAG. 3 For more information please read the following paper. Sebastian Ruder (2017), ”An overview of gradient descent optimization algorithms”, Arxiv. 4 Which optimizer to use? 5 Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. ICLR Workshop, (1), 2013–2016. Hamid Beigy | Sharif university of technology | October 9, 2019 12 / 57
Deep learning | Model selection Table of contents 1 Optimization 2 Model selection 3 Regularization 4 Dataset augmentation 5 Bagging 6 Dropout 7 Reading Hamid Beigy | Sharif university of technology | October 9, 2019 12 / 57
Deep learning | Model selection Model selection 1 Considering regression problem, in which the training set is S = { ( x 1 , t 1 ) , ( x 2 , t 2 ) , . . . , ( x N , t N ) } , t k ∈ R . where t k = f ( x k ) + ϵ ∀ k = 1 , 2 , . . . , N f ( x k ) ∈ R is the unknown function and ϵ is the random noise. 2 The goal is to approximate the f ( x ) by a function g ( x ). 3 The empirical error on the training set S is measured using cost ∑ N i =1 ( t i − g ( x i )) 2 function E E ( g ( x ) | S ) = 1 2 4 The aim is to find g ( . ) that minimizes the empirical error. 5 We assume that a hypothesis class for g ( . ) with a small set of parameters. 6 Assume that g ( x ) is linear g ( x ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D Hamid Beigy | Sharif university of technology | October 9, 2019 13 / 57
Deep learning | Model selection Model selection 1 Define the following vectors and Matrix Data matrix 1 x 11 x 12 . . . x 1 D 1 x 21 x 22 . . . x 2 D X = . . . . . . 1 x N 1 x N 2 . . . x DD The k th input vector: X k = (1 , x k 1 , x k 2 , . . . , x kD ) T The weight vector: W = ( w 0 , w 1 , w 2 , . . . , w D ) T The target vector: t = ( t 1 , t 2 , t 3 , . . . , t N ) T ) 2 . 2 The empirical error equals to: E E ( g ( x ) | S ) = 1 ∑ N ( t k − W T X k 2 k =1 3 The gradient of E E ( g ( x ) | S ) equals to ∇ W E E ( g ( x ) | S ) = ∑ N k − W T ∑ N k =1 t k X T k =1 X k X T k = 0 ) − 1 X T t 4 Solving for W , we obtain W ∗ = ( X T X Hamid Beigy | Sharif university of technology | October 9, 2019 14 / 57
Deep learning | Model selection Model selection 1 If the linear model is too simple, the model can be a polynomial (a more complex hypothesis set) g ( x ) = w 0 + w 1 x + w 2 x 2 + . . . + w M x M . 2 M is the order of the polynomial. 3 Choosing the right value of M is called model selection. 4 For M = 1, we have a too general model 5 For M = 9, we have a too specific model Hamid Beigy | Sharif university of technology | October 9, 2019 15 / 57
Recommend
More recommend