holdout and cross validation validation holdout and cross
play

Holdout and Cross- -Validation Validation Holdout and Cross - PowerPoint PPT Presentation

Holdout and Cross- -Validation Validation Holdout and Cross Methods Overfitting Avoidance Methods Overfitting Avoidance Decision Trees Decision Trees Reduce error pruning Reduce error pruning Cost Cost- -complexity pruning


  1. Holdout and Cross- -Validation Validation Holdout and Cross Methods Overfitting Avoidance Methods Overfitting Avoidance Decision Trees Decision Trees – Reduce error pruning Reduce error pruning – – Cost Cost- -complexity pruning complexity pruning – Neural Networks Neural Networks – Early stopping Early stopping – – Adjusting Regularizers via Cross Adjusting Regularizers via Cross- -Validation Validation – Nearest Neighbor Nearest Neighbor – Choose number of neighbors Choose number of neighbors – Support Vector Machines Support Vector Machines – Choose C Choose C – σ for Gaussian Kernels Choose σ – Choose for Gaussian Kernels –

  2. Reduce Error Pruning Reduce Error Pruning Given a data sets S Given a data sets S – Subdivide S into S Subdivide S into S train and S dev – train and S dev – Build tree using S Build tree using S train – train – Pass all of the S Pass all of the S dev training examples through – dev training examples through the tree and estimate the error rate of each the tree and estimate the error rate of each node using S dev node using S dev – Convert a node to a leaf if it would have lower Convert a node to a leaf if it would have lower – estimated error than the sum of the errors of estimated error than the sum of the errors of its children its children

  3. Reduce Error Pruning Example Reduce Error Pruning Example

  4. Cost- -Complexity Pruning Complexity Pruning Cost The CART system (Breiman et al, 1984), The CART system (Breiman et al, 1984), employs cost- -complexity pruning: complexity pruning: employs cost α |Tree| J(Tree,S) = ErrorRate(Tree,S) + α |Tree| J(Tree,S) = ErrorRate(Tree,S) + where |Tree| is the number of nodes in the where |Tree| is the number of nodes in the α is a parameter that controls the tree and α is a parameter that controls the tree and tradeoff between the error rate and the tradeoff between the error rate and the penalty penalty α is set by cross α is set by cross- -validation validation

  5. α Determining Important Values of α Determining Important Values of Goal: Identify a finite set of candidate values for Goal: Identify a finite set of candidate values for α . Then evaluate them via cross α . Then evaluate them via cross- -validation validation α = α 0 Set α = α = 0; t = 0 Set 0 = 0; t = 0 Train S to produce tree T Train S to produce tree T Repeat until T is completely pruned Repeat until T is completely pruned α = α k+1 determine next larger value of α = α – determine next larger value of that would – k+1 that would cause a node to be pruned from T cause a node to be pruned from T – prune this node prune this node – – t := t + 1 t := t + 1 – This can be done efficiently This can be done efficiently

  6. α by Cross Choosing an α by Cross- -Validation Validation Choosing an Divide S into 10 subsets S 0 , … …, S , S 9 Divide S into 10 subsets S 0 , 9 In fold v In fold v – Train a tree on U Train a tree on U i S i – v S ≠ v i ≠ i α k For each α – For each , prune the tree to that level and – k , prune the tree to that level and measure the error rate on S v measure the error rate on S v ε k Compute ε – Compute to be the average error rate over – k to be the average error rate over α = α k the 10 folds when α = α the 10 folds when k α k ε k α * Choose the α that minimizes ε . Call it α – Choose the – k that minimizes k . Call it * ε * and let ε be the corresponding error rate and let * be the corresponding error rate α * Prune the original tree according to α Prune the original tree according to *

  7. α SE Rule for Setting α The 1- -SE Rule for Setting The 1 ε * Compute a confidence interval on ε and let U U be the be the Compute a confidence interval on * and let upper bound of this interval upper bound of this interval α k ε k Compute the smallest α whose ε · U . If we use Z=1 Compute the smallest k whose k · U . If we use Z=1 for the confidence interval computation, this is called the for the confidence interval computation, this is called the 1- -SE rule, because the bound is one SE rule, because the bound is one “ “standard error standard error” ” 1 ε * above ε above *

  8. Notes on Decision Tree Pruning Notes on Decision Tree Pruning Cost- -complexity pruning usually gives best results in complexity pruning usually gives best results in Cost experimental studies experimental studies Pessimistic pruning is the most efficient (does not Pessimistic pruning is the most efficient (does not require holdout or cross- -validation) and it is quite robust validation) and it is quite robust require holdout or cross Reduce- -error pruning is rarely used, because it error pruning is rarely used, because it Reduce consumes training data consumes training data Pruning is more important for regression trees than for Pruning is more important for regression trees than for classification trees classification trees Pruning has relatively little effect for classification trees. Pruning has relatively little effect for classification trees. There are only a small number of possible prunings of a There are only a small number of possible prunings of a tree, and usually the serious errors made by the tree- - tree, and usually the serious errors made by the tree growing process (i.e., splitting on the wrong features) growing process (i.e., splitting on the wrong features) cannot be repaired by pruning. cannot be repaired by pruning. – Ensemble methods work much better than pruning – Ensemble methods work much better than pruning

  9. Holdout Methods for Neural Networks Holdout Methods for Neural Networks Early Stopping using a development set Early Stopping using a development set Adjusting Regularizers using a Adjusting Regularizers using a development set or via cross- -validation validation development set or via cross – amount of weight decay amount of weight decay – – number of hidden units number of hidden units – – learning rate learning rate – – number of epochs number of epochs –

  10. Early Stopping using an Evaluation Set Early Stopping using an Evaluation Set Dev Test Split S into S train and S dev Split S into S train and S dev Train on S train , after every epoch, evaluate on S dev . If Train on S train , after every epoch, evaluate on S dev . If error rate is best observed, save the weights error rate is best observed, save the weights

  11. Reconstituted Early Stopping Reconstituted Early Stopping Recombine S train and S dev to produce S Recombine S train and S dev to produce S Train on S and stop at the point (# of epochs or Train on S and stop at the point (# of epochs or mean squared error) identified using S dev mean squared error) identified using S dev

  12. Reconstituted Early Stopping Reconstituted Early Stopping Dev Test We can stop either when MSE on the training set matches the We can stop either when MSE on the training set matches the predicted optimal MSE or when the number of epochs matches the predicted optimal MSE or when the number of epochs matches the predicted optimal number of epochs predicted optimal number of epochs Experimental studies show little or no advantage for reconstituted Experimental studies show little or no advantage for reconstitut ed early stopping. Most people just use simple holdout early stopping. Most people just use simple holdout

  13. Nearest Neighbor: Choosing k Nearest Neighbor: Choosing k Dev Test LOOCV k=9 gives best performance on development set and on test set. k=13 k=13 k=9 gives best performance on development set and on test set. gives best performance based on leave- -one one- -out cross out cross- -validation validation gives best performance based on leave

  14. σ SVM Choosing C and σ SVM Choosing C and (BR Data Set; 100 examples; Valentini 2003) (BR Data Set; 100 examples; Valentini 2003)

  15. 20% label noise 20% label noise

  16. σ for fixed C BR Data Set: Varying σ for fixed C BR Data Set: Varying

  17. Summary Summary Holdout methods are the best way to Holdout methods are the best way to choose a classifier classifier choose a – Reduce error pruning for trees Reduce error pruning for trees – – Early stopping for neural networks Early stopping for neural networks – Cross- -validation methods are the best way validation methods are the best way Cross to set a regularization parameter regularization parameter to set a α complexity pruning parameter α – Cost Cost- -complexity pruning parameter – – Neural network weight decay setting Neural network weight decay setting – – Number Number k k of nearest neighbors in of nearest neighbors in k -NN NN – k - σ for SVMs C and σ – C and for SVMs –

Recommend


More recommend