Learning From Data Lecture 11 Overfitting What is Overfitting When does Overfitting Occur Stochastic and Deterministic Noise M. Magdon-Ismail CSCI 4100/6100
recap: Nonlinear Transforms Φ − → ˜ X -space is R d d Z -space is R 1 1 1 x 1 Φ 1 ( x ) z 1 x = z = Φ ( x ) = = . . . . . . . . . x d Φ ˜ d ( x ) z ˜ 1. Original data 2. Transform the data d x n ∈ X z n = Φ( x n ) ∈ Z x 1 , x 2 , . . . , x N z 1 , z 2 , . . . , z N ↓ y 1 , y 2 , . . . , y N y 1 , y 2 , . . . , y N w 0 w 1 no weights w = ˜ . . . w ˜ ‘ Φ − 1 ’ d ← − d vc = d + 1 d vc = d + 1 g ( x ) = sign( ˜ w t Φ ( x )) 4. Classify in X -space 3. Separate data in Z -space g ( x ) = ˜ g (Φ( x )) = sign( ˜ w t Φ( x )) g ( z ) = sign( ˜ ˜ w t z ) M Overfitting : 2 /25 � A c L Creator: Malik Magdon-Ismail Digits data − →
recap: Digits Data “1” Versus “All” Symmetry Symmetry Average Intensity Average Intensity Linear model 3rd order polynomial model E in = 2 . 13% E in = 1 . 75% E out = 2 . 38% E out = 1 . 87% M Overfitting : 3 /25 � A c L Creator: Malik Magdon-Ismail Superstitions − →
Superstitions – Myth or Reality? • Paraskevedekatriaphobia – fear of Friday the 13th. – Are future Friday the 13ths really more dangerous? • OCD [medical journal, citation lost, can you find it?] the subjects performs an action which leads to a good outcome and thereby generalizes it as cause and effect: the action will always give good results. Having overfit the data, the subject compulsively engages in that activity. Humans are overfitting machines , very good at “finding coincidences”. M Overfitting : 4 /25 � A c L Creator: Malik Magdon-Ismail Simple illustration − →
An Illustration of Overfitting on a Simple Example Data Quadratic f Target 5 data points y A little noise (measurement error) 5 data points → 4th order polynomial fit x Classic overfitting: simple target with excessively complex H . The noise did us in. (why?) M Overfitting : 5 /25 � A c L Creator: Malik Magdon-Ismail Classic overfitting − →
An Illustration of Overfitting on a Simple Example Data Target Quadratic f Fit 5 data points y A little noise (measurement error) 5 data points → 4th order polynomial fit x Classic overfitting: simple target with excessively complex H . E in ≈ 0; E out ≫ 0 The noise did us in. (why?) M Overfitting : 6 /25 � A c L Creator: Malik Magdon-Ismail What is overfitting? − →
What is Overfitting? Fitting the data more than is warranted M Overfitting : 7 /25 � A c L Creator: Malik Magdon-Ismail Is it bad generalization? − →
Overfitting is Not Just Bad Generalization out-of-sample error Error bad generalization in-sample error VC dimension, d vc VC Analysis: Covers bad generalization but with lots of slack – the VC bound is loose M Overfitting : 8 /25 � A c L Creator: Malik Magdon-Ismail Beyond bad generalization − →
Overfitting is Not Just Bad Generalization out-of-sample error Error overfitting in-sample error VC dimension, d vc Overfitting: Going for lower and lower E in results in higher and higher E out M Overfitting : 9 /25 � A c L Creator: Malik Magdon-Ismail Case study: simple and complex f − →
Case Study: 2nd vs 10th Order Polynomial Fit y y Data Data Target Target x x 10th order f with noise. 50th order f with no noise. H 2 : 2nd order polynomial fit − special case of linear models with feature transform x �→ (1 , x, x 2 , · · · ) . ← H 10 : 10th order polynomial fit Which model do you pick for which problem and why? M Overfitting : 10 /25 � A c L Creator: Malik Magdon-Ismail H 2 versus H 10 − →
Case Study: 2nd vs 10th Order Polynomial Fit y y Data Data Target Target x x 10th order f with noise. 50th order f with no noise. H 2 : 2nd order polynomial fit − special case of linear models with feature transform x �→ (1 , x, x 2 , · · · ) . ← H 10 : 10th order polynomial fit Which model do you pick for which problem and why? M Overfitting : 11 /25 � A c L Creator: Malik Magdon-Ismail H 2 wins for both cases − →
Case Study: 2nd vs 10th Order Polynomial Fit y y Data Data 2nd Order Fit 2nd Order Fit 10th Order Fit 10th Order Fit x x simple noisy target complex noiseless target 2nd Order 10th Order 2nd Order 10th Order 10 − 5 0.050 0.034 0.029 E in E in 0.127 9.00 0.120 7680 E out E out Go figure: Simpler H is better even for the more complex target with no noise. M Overfitting : 12 /25 � A c L Creator: Malik Magdon-Ismail Is there really no noise − →
Is there Really “No Noise” with the Complex f ? y y Data Data Target Target x x Simple f with noise. Complex f with no noise. H should match quantity and quality of data , not f M Overfitting : 13 /25 � A c L Creator: Malik Magdon-Ismail Look only at the data − →
Is there Really “No Noise” with the Complex f ? y y x x Simple f with noise. Complex f with no noise. H should match quantity and quality of data , not f M Overfitting : 14 /25 � A c L Creator: Malik Magdon-Ismail Learning curves for H 2 , H 10 − →
When is H 2 Better than H 10 ? Learning curves for H 2 Learning curves for H 10 Expected Error Expected Error E out E in E out E in Number of Data Points, N Number of Data Points, N Overfitting: E out ( H 10 ) > E out ( H 2 ) Overfit measure σ 2 vs. N − M Overfitting : 15 /25 � A c L Creator: Malik Magdon-Ismail →
Overfit Measure: E out ( H 10 ) − E out ( H 2 ) 0.2 2 0.1 Noise Level, σ 2 0 1 -0.1 -0.2 0 80 100 120 Number of Data Points, N M Overfitting : 16 /25 � A c L Creator: Malik Magdon-Ismail Overfit measure Q f vs. N − →
Overfit Measure: E out ( H 10 ) − E out ( H 2 ) 0.2 100 0.2 Target Complexity, Q f 2 0.1 0.1 75 Noise Level, σ 2 0 0 50 1 -0.1 -0.1 25 -0.2 0 -0.2 0 80 100 120 80 100 120 Number of Data Points, N Number of Data Points, N Number of data points ↑ Overfitting ↓ Noise ↑ Overfitting ↑ Target complexity ↑ Overfitting ↑ M Overfitting : 17 /25 � A c L Creator: Malik Magdon-Ismail Define ‘noise’ − →
Noise That part of y we cannot model it has two sources . . . M Overfitting : 18 /25 � A c L Creator: Malik Magdon-Ismail Stochastic noise − →
Stochastic Noise — Data Error We would like to learn from ◦ : y = f ( x ) y n = f ( x n ) stoch. noise Unfortunately, we only observe ◦ : y y n = f ( x n ) + ‘stochastic noise’ ↑ no one can model this x Stochastic Noise: fluctuations/measurement errors we cannot model. M Overfitting : 19 /25 � A c L Creator: Malik Magdon-Ismail Deterministic noise − →
Deterministic Noise — Model Error We would like to learn from ◦ : best approximation to f in H h ∗ ( x ) y n = h ∗ ( x n ) det. noise Unfortunately, we only observe ◦ : y y = f ( x ) y n = f ( x n ) = h ∗ ( x n ) + ‘deterministic noise’ ↑ H cannot model this x Deterministic Noise: the part of f we cannot model. M Overfitting : 20 /25 � A c L Creator: Malik Magdon-Ismail Both hurt learning − →
Stochastic & Deterministic Noise Hurt Learning Stochastic Noise Deterministic Noise f ( x ) h ∗ y y y = h ∗ ( x )+det. noise y = f ( x )+stoch. noise x x source: random measurement errors source: learner’s H cannot model f re-measure y n re-measure y n stochastic noise changes. deterministic noise the same. change H change H stochastic noise the same. deterministic noise changes. We have single D and fixed H so we cannot distinguish M Overfitting : 21 /25 � A c L Creator: Malik Magdon-Ismail Stochastic noise and bias - var − →
Noise and the Bias-Variance Decomposition y = f ( x ) + ǫ ↑ measurement error E [ E out ( x )] = E D ,ǫ [( g ( x ) − f ( x ) − ǫ ) 2 ] = E D ,ǫ [( g ( x ) − f ( x )) 2 + 2( g ( x ) − f ( x )) ǫ + ǫ 2 ] ↓ ↓ ↓ σ 2 bias + var 0 bias - var - σ 2 and noise − M Overfitting : 22 /25 � A c L Creator: Malik Magdon-Ismail →
Noise and the Bias-Variance Decomposition y = f ( x ) + ǫ ↑ measurement error σ 2 E [ E out ( x )] = + + bias var ↑ ↑ ↑ stochastic deterministic indirect noise noise impact of noise M Overfitting : 23 /25 � A c L Creator: Malik Magdon-Ismail Noise causes overfitting − →
Noise is the Culprit Overfitting is the disease Noise is the cause Learning is led astray by fitting the noise more than the signal Cures Regularization: Putting on the brakes. Validation: A reality check from peeking at E out (the bottom line). M Overfitting : 24 /25 � A c L Creator: Malik Magdon-Ismail Regularization teaser − →
Regularization no regularization regularization! Data Target Fit y x M Overfitting : 25 /25 � A c L Creator: Malik Magdon-Ismail Regularization teaser − →
Regularization no regularization regularization! Data Target Fit y y x x M Overfitting : 26 /25 � A c L Creator: Malik Magdon-Ismail
Recommend
More recommend