Introduction Training Measuring Classifier Performance Comparing Classifiers Introduction Training Measuring Classifier Performance Comparing Classifiers Introduction Questions: Assessment of the expected error of a learning algorithm: Is the error rate of 1 − NN less than 2%? Designing ML Experiments Comparing the expected errors of two algorithms: Is k-NN more accurate than MLP? Training/validation/test sets Steven J Zeil Old Dominion Univ. Fall 2010 1 2 Introduction Training Measuring Classifier Performance Comparing Classifiers Introduction Training Measuring Classifier Performance Comparing Classifiers Training Algorithm Preference Criteria (Application-dependent): Introduction 1 Misclassification error, or risk (loss functions) Training time/space complexity Training 2 Testing time/space complexity Response Surface Design Interpretability Cross-Validation & Resampling Easy programmability Measuring Classifier Performance 3 Comparing Classifiers 4 Comparing Two Classifiers Comparing Multiple Classifiers Comparing Over Multiple Datasets 3 4
Introduction Training Measuring Classifier Performance Comparing Classifiers Introduction Training Measuring Classifier Performance Comparing Classifiers Factors and Response Algorithm Preference Criteria (Application-dependent): Misclassification error, or risk (loss functions) Training time/space complexity Testing time/space complexity Interpretability Easy programmability Select desired response function measuring the desired criteria. 5 6 Introduction Training Measuring Classifier Performance Comparing Classifiers Introduction Training Measuring Classifier Performance Comparing Classifiers Response surface design Resampling and K-Fold Cross-Validation The need for multiple training/validation sets { T i , V i } i : Training/validation sets of fold i K-fold cross-validation: Divide X into k sets, X i V 1 = X 1 T 1 = X 2 ∪ X 3 ∪ . . . ∪ X k = X − X 1 V 2 = X 2 T 2 = X 1 ∪ X 3 ∪ . . . ∪ X k = X − X 2 . . . . . . V k = X k T k = X − X k For approximating and maximizing the response function in terms Each pair of T i share k − 2 parts of the controllable factors. 7 8
Introduction Training Measuring Classifier Performance Comparing Classifiers Introduction Training Measuring Classifier Performance Comparing Classifiers 5x2 Cross-Validation Measuring Classifier Performance Predicted Class Perform 2-Fold Cross-Validation 5 times True Class Yes No V 1 = X (1) T 1 = X (1) Yes TP: true positive FN: false negative 1 2 V 1 = X (2) T 1 = X (2) No FP: false positive TN: true negative 1 2 V 1 = X (3) T 1 = X (3) 1 2 Error rate = # of errors / # of instances V 1 = X (4) T 1 = X (4) 1 2 = (FN + FP) / N V 1 = X (5) T 1 = X (5) 1 2 Recall = # of found positives / # of positives using 5 different divisions into half = TP / (TP + FN) = sensitivity = hit rate Precision = # of found positives / # of found = TP / (TP + FP) Specificity = TN / (TN + FP) False alarm rate = FP / (FP + TN) = 1 − Specificity 9 10 Introduction Training Measuring Classifier Performance Comparing Classifiers Introduction Training Measuring Classifier Performance Comparing Classifiers Receiver Operating Characteristics Comparing Classifiers Introduction 1 Training 2 Response Surface Design Cross-Validation & Resampling Measuring Classifier Performance 3 Comparing Classifiers 4 Comparing Two Classifiers Comparing Multiple Classifiers Comparing Over Multiple Datasets 11 12
Introduction Training Measuring Classifier Performance Comparing Classifiers Introduction Training Measuring Classifier Performance Comparing Classifiers McNemar’s Test K-fold Cross-Validated Paired t Test H 0 : µ 0 = µ 1 Use K-fold c-v to get K training/validation folds p 1 i , p 2 Single training/validation set i : Errors of classifiers 1 and 2 on fold i e 00 : Number of examples e 01 : Number of examples p i = p 1 i p 2 i : Paired difference on fold i misclassified by both misclassified by 1 but not H 0 : p i has mean 0 by 2 e 10 : Number of examples e 11 : Number of exam- � K � K i =1 ( p i − m ) 2 i =1 p i s 2 = m = misclassified by 2 but not ples correctly classified by K − 1 K by 1 both √ m K Under H 0 we expect e 01 = e 10 ∼ t k − 1 s ( | e 01 − e 10 | − 1) 2 Accept if in ( − t α/ 2 , K − 1 , t α/ 2 , K − 1 ) ∼ χ 2 1 e 01 + e 10 Accept with confidence 1 − α if < χ 2 α, 1 13 14 Introduction Training Measuring Classifier Performance Comparing Classifiers Introduction Training Measuring Classifier Performance Comparing Classifiers Comparing L > 2 Classifiers Comparing L > 2 Classifiers (cont.) Anova constructs two estimates of σ 2 Analysis of variance (Anova) ( m j − m ) 2 b ≈ K � L If H 0 is true, σ 2 and SS b σ 2 ∼ χ 2 L − 1 where j =1 L − 1 H 0 : µ 1 = µ 2 = . . . = µ L L Errors of L algorithms on K folds � ( m j − m ) 2 SS b = K j =1 X ij ∼ N ( µ j , σ 2 ) j = 1 , . . . , L i = 1 , . . . , K i ( X ij − m j ) 2 � � Regardless of the truth of H 0 , σ 2 w ≈ j and L ( K − 1) SS w σ 2 ∼ χ 2 L , K − 1 where � � ( X ij − m j ) 2 SS w = j i σ 2 w ∼ F L − 1 , L ( K − 1) b σ 2 Accept H 0 if < F α, L − 1 , L ( K − 1) 15 16
Introduction Training Measuring Classifier Performance Comparing Classifiers Comparing Over Multiple Datasets Comparing two algorithms: Sign test : Count how many times A beats B over N datasets, and check if this could have been by chance if A and B did have the same error rate Comparing multiple algorithms Kruskal-Wallis test : Calculate the average rank of all algorithms on N datasets, and check if these could have been by chance if they all had equal error If KW rejects, we do pairwise posthoc tests to find which ones have significant rank difference 17
Recommend
More recommend