CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization, Early stopping, Dataset augmentation, Parameter sharing and tying, Injecting noise at input, Ensemble methods, Dropout Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Acknowledgements Chapter 7, Deep Learning book Ali Ghodsi’s Video Lectures on Regularization a Dropout: A Simple Way to Prevent Neural Networks from Overfitting b a Lecture 2.1 and Lecture 2.2 b Dropout 2/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.1 : Bias and Variance 3/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We will begin with a quick overview of bias, variance and the trade-off between them. 4/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us consider the problem of fitting a curve through a given set of points We consider two models : y = ˆ Simple f ( x ) = w 1 x + w 0 ( degree :1) Simple 25 � w i x i + w 0 Complex y = ˆ f ( x ) = ( degree :25) i =1 Complex Note that in both cases we are making an as- The points were drawn from a si- sumption about how y is related to x . We nusoidal function (the true f ( x )) have no idea about the true relation f ( x ) The training data consists of 100 points 5/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We sample 25 points from the training data and train a simple and a complex model Simple We repeat the process ‘ k ’ times to train multiple models (each model sees a different Complex sample of the training data) The points were drawn from We make a few observations from these plots a sinusoidal function (the true f ( x )) 6/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of the data do not differ much from each other However they are very far from the true sinus- oidal curve (under fitting) On the other hand, complex models trained on different samples of the data are very different from each other (high variance) 8/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let f ( x ) be the true model (sinusoidal in this case) and ˆ f ( x ) be our estimate of the model (simple or complex, in this case) then, Bias ( ˆ f ( x )) = E [ ˆ f ( x )] − f ( x ) E [ ˆ f ( x )] is the average (or expected) value of the model We can see that for the simple model the av- erage value (green line) is very far from the Green Line: Average value of ˆ f ( x ) true value f ( x ) (sinusoidal function) for the simple model Blue Curve: Average value of ˆ f ( x ) Mathematically, this means that the simple for the complex model model has a high bias Red Curve: True model ( f (x)) On the other hand, the complex model has a low bias 9/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We now define, Variance ( ˆ f ( x )) = E [( ˆ f ( x ) − E [ ˆ f ( x )]) 2 ] (Standard definition from statistics) Roughly speaking it tells us how much the dif- ferent ˆ f ( x )’s (trained on different samples of the data) differ from each other It is clear that the simple model has a low vari- ance whereas the complex model has a high variance 10/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
In summary (informally) Simple model: high bias, low variance Complex model: low bias, high variance There is always a trade-off between the bias and variance Both bias and variance contribute to the mean square error. Let us see how 11/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.2 : Train error vs Test error 12/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Consider a new point ( x, y ) which was not seen during training We can show that If we use the model ˆ f ( x ) to predict the E [( y − ˆ f ( x )) 2 ] = Bias 2 value of y then the mean square error is given by + V ariance + σ 2 (irreducible error) E [( y − ˆ f ( x )) 2 ] See proof here (average square error in predicting y for many such unseen points) 13/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
error High bias High variance Sweet spot- The parameters of ˆ f ( x ) (all w i ’s) are trained -perfect tradeoff using a training set { ( x i , y i ) } n -ideal model i =1 complexity However, at test time we are interested in eval- uating the model on a validation (unseen) set which was not used for training model complexity This gives rise to the following two entities of interest: train err (say, mean square error) test err (say, mean square error) E [( y − ˆ f ( x )) 2 ] = Bias 2 Typically these errors exhibit the trend shown + V ariance in the adjacent figure + σ 2 (irreducible error) 14/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Intuitions developed so far Let there be n training points and m test (validation) points n � train err = 1 ( y i − ˆ f ( x i )) 2 n i =1 n + m � test err = 1 ( y i − ˆ f ( x i )) 2 m i = n +1 As the model complexity increases train err becomes overly optimistic and gives us a wrong picture of how close ˆ f is to f The validation error gives the real picture of how close ˆ f is to f We will concretize this intuition mathematically now and eventually show how to account for the optimism in the training error 15/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Further we use ˆ f to approximate f Let D= { x i , y i } m + n i =1 , then for any and estimate the parameters using T point ( x, y ) we have, ⊂ D such that y i = f ( x i ) + ε i y i = ˆ f ( x i ) which means that y i is related to x i We are interested in knowing by some true function f but there is E [( ˆ f ( x i ) − f ( x i )) 2 ] also some noise ε in the relation For simplicity, we assume but we cannot estimate this directly because we do not know f ε ∼ N (0 , σ 2 ) We will see how to estimate this em- pirically using the observation y i & and of course we do not know f prediction ˆ y i 16/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
y i − y i ) 2 ] = E [( ˆ f ( x i ) − f ( x i ) − ε i ) 2 ] E [( ˆ ( y i = f ( x i ) + ε i ) f ( x i ) − f ( x i )) 2 − 2 ε i ( ˆ = E [( ˆ f ( x i ) − f ( x i )) + ε 2 i ] = E [( ˆ f ( x i ) − f ( x i )) 2 ] − 2 E [ ε i ( ˆ f ( x i ) − f ( x i ))] + E [ ε 2 i ] ∴ E [( ˆ i ] + 2 E [ ε i ( ˆ f ( x i ) − f ( x i )) 2 ] = E [( ˆ y i − y i ) 2 ] − E [ ε 2 f ( x i ) − f ( x i )) ] 17/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We will take a small detour to understand how to empirically estimate an Expectation and then return to our derivation 18/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose we have observed the goals scored( z ) in k matches as z 1 = 2, z 2 = 1, z 3 = 0, ... z k = 2 Now we can empirically estimate E [ z ] i.e. the expected number of goals scored as k � E [ z ] = 1 z i k i =1 Analogy with our derivation: We have a certain number of observations y i & predictions ˆ y i using which we can estimate m � y i − y i ) 2 ] = 1 y i − y i ) 2 E [( ˆ ( ˆ m i =1 19/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
... returning back to our derivation 20/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E [( ˆ i ] + 2 E [ ε i ( ˆ f ( x i ) − f ( x i )) 2 ] = E [( ˆ y i − y i ) 2 ] − E [ ε 2 f ( x i ) − f ( x i )) ] We can empirically evaluate R.H.S using training observations or test observa- tions Case 1 : Using test observations n + m n + m � � 1 − 1 E [( ˆ + 2 E [ ε i ( ˆ f ( x i ) − f ( x i )) 2 ] y i − y i ) 2 ε 2 = ( ˆ f ( x i ) − f ( x i )) ] i � �� � � �� � m m i = n +1 i = n +1 true error � �� � � �� � = covariance ( ε i , ˆ f ( x i ) − f ( x i )) empirical estimation of error small constant ∵ covariance( X, Y ) = E [( X − µ X )( Y − µ Y )] = E [( X )( Y − µ Y )](if µ X = E [ X ] = 0) = E [ XY ] − E [ Xµ Y ] = E [ XY ] − µ Y E [ X ] = E [ XY ] 21/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E [( ˆ f ( x i ) − f ( x i )) 2 ] � �� � true error n + m n + m � � 1 − 1 y i − y i ) 2 ε 2 + 2 E [ ε i ( ˆ f ( x i ) − f ( x i )) ] = ( ˆ i m m � �� � i = n +1 i = n +1 � �� � � �� � = covariance ( ε i , ˆ f ( x i ) − f ( x i )) empirical estimation of error small constant None of the test observations participated in the estimation of ˆ f ( x )[the para- meters of ˆ f (x) were estimated only using training data] ∴ ε ⊥ ( ˆ f ( x i ) − f ( x i )) ∴ E [ ε i · ( ˆ f ( x i ) − f ( x i ))] = E [ ε i ] · E [ ˆ f ( x i ) − f ( x i ))] = 0 · E [ ˆ f ( x i ) − f ( x i ))] = 0 ∴ true error = empirical test error + small constant Hence, we should always use a validation set(independent of the training set) to estimate the error 22/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Recommend
More recommend