training neural networks
play

Training Neural Networks Milan Straka March 11, 2019 Charles - PowerPoint PPT Presentation

NPFL114, Lecture 2 Training Neural Networks Milan Straka March 11, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Estimators and Bias An estimator is a


  1. NPFL114, Lecture 2 Training Neural Networks Milan Straka March 11, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. Estimators and Bias An estimator is a rule for computing an estimate of a given value, often an expectation of some random value(s). Bias of an estimator is the difference of the expected value of the estimator and the true value being estimated. If the bias is zero, we call the estimator unbiased , otherwise we call it biased . If we have a sequence of estimates, it also might happen that the bias converges to zero. x , … , x 1 n Consider the well known sample estimate of variance. Given idenpendent and identically distributed random variables, we might estimate mean and variance as 1 ∑ 1 ∑ ^ 2 ^ = , ^ 2 = ( x − ) . μ x σ μ i i n n i i E [ 1 ^ 2 2 ] = (1 − ) σ σ n Such estimate is biased, because , but the bias converges to zero with n increasing . Also, an unbiased estimator does not necessarily have small variance – in some cases it can have large variance, so a biased estimator with smaller variance might be preferred. NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 2/39

  3. Machine Learning Basics We usually have a training set , which is assumed to consist of examples generated independently from a data generating distribution . The goal of optimization is to match the training set as well as possible. However, the main goal of machine learning is to perform well on previously unseen data, so called generalization error or test error . We typically estimate the generalization error using a test set of examples independent of the training set, but generated by the same data generating distribution. NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 3/39

  4. Machine Learning Basics Challenges in machine learning: underfitting overfitting Figure 5.2, page 113 of Deep Learning Book, http://deeplearningbook.org NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 4/39

  5. Machine Learning Basics We can control whether a model underfits or overfits by modifying its capacity . representational capacity effective capacity Figure 5.3, page 115 of Deep Learning Book, http://deeplearningbook.org The No free lunch theorem (Wolpert, 1996) states that averaging over all possible data distributions, every classification algorithm achieves the same overall error when processing unseen examples. In a sense, no machine learning algorithm is universally better than others. NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 5/39

  6. Machine Learning Basics Any change in a machine learning algorithm that is designed to reduce generalization error but not necessarily its training error is called regularization . L 2 regularization (also called weighted decay) penalizes models with large weights (i.e., penalty ∣∣ θ ∣∣ 2 of ). Figure 5.5, page 119 of Deep Learning Book, http://deeplearningbook.org NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 6/39

  7. Machine Learning Basics Hyperparameters are not adapted by learning algorithm itself. Usually a validation set or development set is used to estimate the generalization error, allowing to update hyperparameters accordingly. NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 7/39

  8. Loss Function A model is usually trained in order to minimize the loss on the training data. f ( x ; θ ) θ Assuming that a model computes using parameters , the mean square error is computed as m 1 2 ∑ ( ( i ) ) ( i ) f ( x ; θ ) − y . m i =1 A common principle used to design loss functions is the maximum likelihood principle . NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 8/39

  9. Maximum Likelihood Estimation X = { x (1) (2) ( m ) , x , … , x } Let be training data drawn independently from the data-generating ^ data p p data distribution . We denote the empirical data distribution as . ( x ; θ ) p θ model Let be a family of distributions. The maximum likelihood estimation of is: ( X ; θ ) ML = arg max θ p model θ arg max ∏ m ( i ) = ( x ; θ ) p model i =1 θ arg min ∑ m ( i ) = − log p ( x ; θ ) model i =1 θ E = arg min [− log p ( x ; θ )] x ∼ ^ data model p θ = arg min H ( ^ data , p ( x ; θ )) p model θ = arg min ( ^ data ∣∣ p ( x ; θ )) + H ( ^ data ) D KL p p model θ NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 9/39

  10. Maximum Likelihood Estimation y x MLE can be easily generalized to a conditional case, where our goal is to predict given : ( Y ∣ X ; θ ) ML = arg max θ p model θ arg max ∏ m ( i ) ( i ) = ( y ∣ x ; θ ) p model i =1 θ arg min ∑ m ( i ) ( i ) = − log p ( y ∣ x ; θ ) model i =1 θ The resulting loss function is called negative log likelihood , or cross-entropy or Kullback-Leibler divegence . NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 10/39

  11. Properties of Maximum Likelihood Estimation p data Assume that the true data generating distribution lies within the model family (⋅; θ ) = (⋅; θ ) p θ p p model data model p p data data , and assume there exists a unique such that . θ m MLE is a consistent estimator. If we denote to be the parameters found by MLE for a m θ m training set with examples generated by the data generating distribution, then θ p data converges in probability to . ε > 0 P (∣∣ θ − ∣∣ > ε ) → 0 m → ∞ θ m p data Formally, for any , as . MLE is in a sense most statistic efficient . For any consistent estimator, we might consider E 2 [∣∣ θ − ∣∣ ] θ θ θ x ,…, x ∼ p 2 m p m p data 1 data data m the average distance of and , formally . It can be shown (Rao 1945, Cramér 1946) that no consistent estimator has a lower mean squared error than the maximum likelihood estimator. Therefore, for reasons of consistency and efficiency, maximum likelihood is often considered the preferred estimator for machine learning. NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 11/39

  12. Mean Square Error as MLE y ∈ R p ( y ∣ x ) Assume our goal is to perform a regression, i.e., to predict for . ^ ( x ; θ ) y y Let give a prediction of mean of . N ( y ; 2 p ( y ∣ x ) ^ ( x ; θ ), σ ) y σ We define as for a given fixed . Then: m ∑ ( i ) ( i ) arg max p ( y ∣ x ; θ ) = arg min − log p ( y ∣ x ; θ ) θ θ i =1 m ( i ) ( i ) 2 1 ( y − ^ ( x ; θ )) y ∑ − = − arg min log 2 σ 2 e 2 πσ 2 θ i =1 m ( i ) ( i ) 2 ( y − ^ ( x ; θ )) y ∑ 2 −1/2 = − arg min m log(2 πσ ) + − 2 σ 2 θ i =1 m m ( i ) ( i ) 2 ( y − ^ ( x ; θ )) y ∑ ∑ ( i ) ( i ) 2 = arg min = arg min ( y − ^ ( x ; θ )) . y 2 σ 2 θ θ i =1 i =1 NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 12/39

  13. Gradient Descent f ( x ; θ ) θ L Let a model compute using parameters , and for a given loss function denote J ( θ ) = E L ( f ( x ; θ ), y ) . ( x , y )∼ ^ data p In order to compute arg min J ( θ ) θ we may use gradient descent : θ ← θ − α ∇ J ( θ ) θ Figure 4.1, page 83 of Deep Learning Book, http://deeplearningbook.org NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 13/39

  14. Gradient Descent Variants (Regular) Gradient Descent J ( θ ) We use all training data to compute . Online (or Stochastic) Gradient Descent J ( θ ) We estimate the expectation in using a single randomly sampled example from the training data. Such an estimate is unbiased, but very noisy. J ( θ ) = L ( f ( x ; θ ), y ) for randomly chosen ( x , y ) from ^ data . p Minibatch SGD J ( θ ) The minibatch SGD is a trade-off between gradient descent and SGD – the expectation in m is estimated using random independent examples from the training data. m 1 ∑ ( i ) ( i ) ( i ) ( i ) J ( θ ) = L ( f ( x ; θ ), y ) for randomly chosen ( x , y ) from ^ data . p m i =1 NPFL114, Lecture 2 ML Basics Loss Gradient Descent Backpropagation NN Training SGDs Adaptive LR LR Schedules 14/39

Recommend


More recommend