bias variance trade off crossvalidation regularization
play

Bias-variance trade-off. Crossvalidation. Regularization. Petr Po - PowerPoint PPT Presentation

CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Bias-variance trade-off. Crossvalidation. Regularization. Petr Po s k P. Po s k c 2015 Artificial Intelligence 1 / 13


  1. CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Bias-variance trade-off. Crossvalidation. Regularization. Petr Poˇ s´ ık P. Poˇ s´ ık c � 2015 Artificial Intelligence – 1 / 13

  2. How to evaluate a predictive model? P. Poˇ s´ ık c � 2015 Artificial Intelligence – 2 / 13

  3. Model evaluation Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint? P. Poˇ s´ ık c � 2015 Artificial Intelligence – 3 / 13

  4. Model evaluation Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint? ■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk . ■ Are these functions good approximations when measured on the data the models were trained on? P. Poˇ s´ ık c � 2015 Artificial Intelligence – 3 / 13

  5. Model evaluation Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint? ■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk . ■ Are these functions good approximations when measured on the data the models were trained on? 3 f(x) = x f(x) = x 3 −3x 2 +3x 2.5 2 1.5 1 0.5 0 −0.5 −1 −0.5 0 0.5 1 1.5 2 2.5 P. Poˇ s´ ık c � 2015 Artificial Intelligence – 3 / 13

  6. Model evaluation Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint? ■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk . ■ Are these functions good approximations when measured on the data the models were trained on? 3 f(x) = x f(x) = x 3 −3x 2 +3x 2.5 2 1.5 1 0.5 0 −0.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Using MSE only, both models are equivalent!!! P. Poˇ s´ ık c � 2015 Artificial Intelligence – 3 / 13

  7. Model evaluation Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint? ■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk . ■ Are these functions good approximations when measured on the data the models were trained on? 3 2.5 f(x) = x f(x) = −0.09 + 0.99x f(x) = x 3 −3x 2 +3x f(x) = 0.00 + (−0.31x) + (1.67x 2 ) + (−0.51x 3 ) 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −1 −0.5 −0.5 0 0.5 1 1.5 2 2.5 −0.5 0 0.5 1 1.5 2 2.5 Using MSE only, both models are equivalent!!! P. Poˇ s´ ık c � 2015 Artificial Intelligence – 3 / 13

  8. Model evaluation Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint? ■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk . ■ Are these functions good approximations when measured on the data the models were trained on? 3 2.5 f(x) = x f(x) = −0.09 + 0.99x f(x) = x 3 −3x 2 +3x f(x) = 0.00 + (−0.31x) + (1.67x 2 ) + (−0.51x 3 ) 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −1 −0.5 −0.5 0 0.5 1 1.5 2 2.5 −0.5 0 0.5 1 1.5 2 2.5 Using MSE only, both models are equivalent!!! Using MSE only, the cubic model is better than linear!!! P. Poˇ s´ ık c � 2015 Artificial Intelligence – 3 / 13

  9. Model evaluation Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint? ■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk . ■ Are these functions good approximations when measured on the data the models were trained on? 3 2.5 f(x) = x f(x) = −0.09 + 0.99x f(x) = x 3 −3x 2 +3x f(x) = 0.00 + (−0.31x) + (1.67x 2 ) + (−0.51x 3 ) 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −1 −0.5 −0.5 0 0.5 1 1.5 2 2.5 −0.5 0 0.5 1 1.5 2 2.5 Using MSE only, both models are equivalent!!! Using MSE only, the cubic model is better than linear!!! A basic method of evaluation is model validation on a different, independent data set from the same source, i.e. on testing data . P. Poˇ s´ ık c � 2015 Artificial Intelligence – 3 / 13

  10. Validation on testing data Example: Polynomial regression with varrying degree: X ∼ U ( − 1, 3 ) Y ∼ X 2 + N ( 0, 1 ) Polynom de g.: 0, tr. e rr.: 8.319, te st. e rr.: 6.901 Polynom de g.: 1, tr. e rr.: 2.013, te st. e rr.: 2.841 Polynom de g.: 2, tr. e rr.: 0.647, te st. e rr.: 0.925 10 10 10 Tra ining da ta Tra ining da ta Tra ining da ta 8 8 8 T e sting da ta T e sting da ta T e sting da ta 6 6 6 4 4 4 y y y 2 2 2 0 0 0 − 2 − 2 − 2 − 4 − 4 − 4 − 2 − 1 0 1 2 3 4 − 2 − 1 0 1 2 3 4 − 2 − 1 0 1 2 3 4 x x x Polynom de g.: 3, tr. e rr.: 0.645, te st. e rr.: 0.919 Polynom de g.: 5, tr. e rr.: 0.611, te st. e rr.: 0.979 Polynom de g.: 9, tr. e rr.: 0.545, te st. e rr.: 1.067 10 10 10 Tra ining da ta Tra ining da ta Tra ining da ta 8 8 8 T e sting da ta T e sting da ta T e sting da ta 6 6 6 4 4 4 y y y 2 2 2 0 0 0 − 2 − 2 − 2 − 4 − 4 − 4 − 2 − 1 0 1 2 3 4 − 2 − 1 0 1 2 3 4 − 2 − 1 0 1 2 3 4 x x x P. Poˇ s´ ık c � 2015 Artificial Intelligence – 4 / 13

  11. Training and testing error 9 How to evaluate a Tra ining e rror predictive model? 8 T e sting e rror • Model evaluation • Training and testing 7 error • Overfitting 6 • Bias vs Variance • Crossvalidation 5 • How to determine a MSE suitable model flexibility 4 • How to prevent overfitting? 3 Regularization 2 1 0 0 2 4 6 8 10 Polynom de gre e ■ The training error decreases with increasing model flexibility. ■ The testing error is minimal for certain degree of model flexibility. P. Poˇ s´ ık c � 2015 Artificial Intelligence – 5 / 13

  12. Overfitting Definition of overfitting : ■ Let H be a hypothesis space. ■ Let h ∈ H and h ′ ∈ H be 2 different hypotheses from Training data Testing data this space. Model Error ■ Let Err Tr ( h ) be an error of the hypothesis h measured on the training dataset (training error). ■ Let Err Tst ( h ) be an error of the hypothesis h measured on the testing dataset (testing error). ■ We say that h is overfitted if there is another h ′ for which Model Flexibility Err Tr ( h ) < Err Tr ( h ′ ) ∧ Err Tst ( h ) > Err Tst ( h ′ ) ■ “When overfitted, the model works well for the training data, but fails for new (testing) data.” ■ Overfitting is a general phenomenon affecting all kinds of inductive learning . P. Poˇ s´ ık c � 2015 Artificial Intelligence – 6 / 13

  13. Overfitting Definition of overfitting : ■ Let H be a hypothesis space. ■ Let h ∈ H and h ′ ∈ H be 2 different hypotheses from Training data Testing data this space. Model Error ■ Let Err Tr ( h ) be an error of the hypothesis h measured on the training dataset (training error). ■ Let Err Tst ( h ) be an error of the hypothesis h measured on the testing dataset (testing error). ■ We say that h is overfitted if there is another h ′ for which Model Flexibility Err Tr ( h ) < Err Tr ( h ′ ) ∧ Err Tst ( h ) > Err Tst ( h ′ ) ■ “When overfitted, the model works well for the training data, but fails for new (testing) data.” ■ Overfitting is a general phenomenon affecting all kinds of inductive learning . We want models and learning algorithms with a good generalization ability , i.e. ■ we want models that encode only the patterns valid in the whole domain , not those that learned the specifics of the training data, ■ we want algorithms able to find only the patterns valid in the whole domain and ignore specifics of the training data. P. Poˇ s´ ık c � 2015 Artificial Intelligence – 6 / 13

  14. Bias vs Variance Polynom de g.: 1, tr. e rr.: 2.013, te st. e rr.: 2.841 Polynom de g.: 2, tr. e rr.: 0.647, te st. e rr.: 0.925 Polynom de g.: 9, tr. e rr.: 0.545, te st. e rr.: 1.067 10 10 10 Tra ining da ta Tra ining da ta Tra ining da ta 8 8 8 T e sting da ta T e sting da ta T e sting da ta 6 6 6 4 4 4 y y y 2 2 2 0 0 0 − 2 − 2 − 2 − 4 − 4 − 4 − 2 − 1 0 1 2 3 4 − 2 − 1 0 1 2 3 4 − 2 − 1 0 1 2 3 4 x x x High bias: High variance: “Just right” model not flexible enough model flexibility too high (Good fit) (Underfit) (Overfit) P. Poˇ s´ ık c � 2015 Artificial Intelligence – 7 / 13

Recommend


More recommend