STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson
Questions/Administrative Business? ◮ Everyone enrolled who intends to be? ◮ Any technical difficulties? ◮ Anything else?
Outline Evaluating a Supervised Learning Method Classification Performance Validation and Test Sets
Types of Learning ◮ Supervised Learning: Learning to make predictions when you have many examples of “correct answers” ◮ Classification: answer is a category / label ◮ Regression: answer is a number ◮ Unsupervised Learning: Finding structure in unlabeled data ◮ Reinforcement Learning: Finding actions that maximize long-run reward (not part of this course)
Classification and Regression If t is a categorical output, then we are doing classification If t is a quantitative output, then we are doing regression NB: “Logistic regression” is really a classification method, in this taxonomy
K -Nearest neighbors algorithm 1. Given a training set , D = { ( x n , t n ) } , n = 1 , . . . , N , a test point, x , and a distance function, d , compute distances: { d n : d ( x , x n ) } , n = 1 , . . . , N 2. Find the K “nearest neighbors” in D to x 3. Classify the test point based on a “plurality vote” of the K -nearest neighbors. 4. In the event of a tie, apply a chosen tie-breaking procedure (e.g., choose the most frequent class / increase K / etc.)
K -nearest-neighbors for Iris data K = 1 K = 3 K = 5 7.5 7.5 7.5 Sepal.Length Sepal.Length Sepal.Length ● ● ● ● ● ● 6.5 ● ● 6.5 ● ● 6.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5.5 ●● ●● ● ● ● ● 5.5 ●● ●● ● ● ● ● 5.5 ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.5 4.5 4.5 2.0 3.0 4.0 2.0 3.0 4.0 2.0 3.0 4.0 Sepal.Width Sepal.Width Sepal.Width K = 11 K = 21 K = N 7.5 7.5 7.5 Sepal.Length Sepal.Length Sepal.Length ● ● ● ● ● ● ● ● ● ● ● ● 6.5 6.5 6.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● 5.5 5.5 5.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.5 4.5 4.5 2.0 3.0 4.0 2.0 3.0 4.0 2.0 3.0 4.0 Sepal.Width Sepal.Width Sepal.Width
Flexibility vs. Robustness ◮ Small K : highly flexible – can fit arbitrarily complex patterns in the data – but not robust (highly sensitive to noise/specific sample properties) ◮ Larger K : mitigates sensitivity to noise, etc., but at the expense of flexibility
Variants of KNN ◮ “Soft” KNN: Retain the vote share for each class, instead of simply taking the max, to do “soft” classification. ◮ “Kernel”-KNN: Use a “kernel” function that decays with distance to weight the votes of the neighbors by their nearness. ◮ Beyond R d : KNN can be used for objects such as strings, trees, graphs by simply defining a suitable distance metric.
Choices to Make Using KNN ◮ What distance measure? (Euclidean ( L 2 ), Manhattan ( L 1 ), Chebyshev ( L ∞ ), Edit distance ( L 0 ), ...) Always standardize your features (e.g., convert to z -scores) so the dimensions are on comparable scales when computing distance! ◮ What value of K ? ◮ What kernel (and what kernel parameters), if any? ◮ What tie-breaking procedure (if doing hard classification)?
Evaluating a Supervised Learning Method Two Kinds of Evaluation 1. How do we select which free “parameters” like K , or kernel decay rate, are best? 2. How do we know how good a job our final method has done? Two Choices To Be Made 1. How do we quantify performance? 2. What data do we use to measure performance?
Quantifying Classification Performance: Misclassification Rate ◮ One possible metric: misclassification rate : what proportion of cases does the classifier get incorrect? Misclassification Rate = 1 � I (ˆ t n � = t n ) N n where ˆ t n is the classifier’s output for training point n , and I ( A ) returns 1 if A is true, 0 otherwise.
Other Classification Measures For binary class problems with asymmetry between classes (e.g., positive and negative instances), there are four possibilities: Classification + − True Positive False Negative + Truth False Positive True Negative − Table: Possible outcomes for a binary classifier We can measure four component success rates: TP TP Recall/Sensitivity = Precision/Pos. Pred. Value = TP + FN TP + FP TN TN Specificity = Neg. Pred. Value = TN + FP TN + FN
F -measures � − 1 1 1 � Recall + Precision F 1 score = 2 = 2 · Recall · Precision Recall + Precision � − 1 � β 2 1 1 Recall + Precision F β score = 1 + β 2 = (1 + β 2 ) · Recall · Precision Recall + β 2 · Precision F β aggregates recall (sensitivity / true positive rate) and precision (positive predictive value), with a “cost parameter” β to emphasize or de-emphasize recall.
Receiver Operating Characteristic (ROC) Curve Figure: Example of an ROC curve. As classifier is more willing to say “ + ”, both true positives and false positives go up. Ideally, false positives go up much more slowly (curve hugs upper left).
Overfitting and Test Set ◮ Fitting and evaluating on the same data (for most evaluation metrics) results in overfitting . ◮ Overfitting occurs when a learning algorithm mistakes noise for signal, and incorporates idiosyncracies of the training set into its decision rule ◮ To combat overfitting, use different data for evaluation vs. fitting. This “held out data” is called a test set
Train vs. Test Error (KNN on Iris data) 1.0 Train Error Test Error 0.8 train.error 0.6 0.4 0.2 0.0 0 10 20 30 40 50 K
Validation vs. Test Set ◮ If we have decisions left to make, then we should not look at the final test set? (Why not?) ◮ If we are going to select the best version of our method by optimizing on the test set, then we have no measure of absolute performance: test set performance is overly optimistic b/c it is cherry-picked. ◮ Instead, take training set and (randomly) subdivide into training and validation set . Use training to do classification; validation to evaluate to guide “higher-order” decisions.
Validation vs. Test Error 1.0 0.8 train.error 0.6 0.4 Train Error 0.2 Validation Error Test Error 0.0 0 10 20 30 40 50 K
Drawbacks of Simple Validation Approach ◮ Sacrificing training data degrades performance ◮ If validation set is too small, decisions will be based on noisy information. ◮ Partial solution: Divide training set into K equal parts, or “folds”; give each fold the chance to serve as validation set, and average generalization performance. ◮ Yields “ K -fold cross-validation” (note: completely separate choice of K here)
K -fold Cross Validation Algorithm A. For each method, M , under consideration 1. Divide training set into K “folds” with (approximately) equal cases per fold. (Keep test set “sealed”) 2. For k = 1 , . . . , K : (a) Designate fold k the “validation set”, and 1 , . . . , k − 1 , k + 1 , . . . , K the training set. (b) “Train” the algorithm on the training set to yield classification rule c k , and compute error rate, Err k on the validation set: e.g. 1 � Err k ( M ) = I ( c k ( x i ) = t i ) | Validation | i ∈ Validation 3. Return the mean error rate across folds K Err ( M ) = 1 � Err k ( M ) K k =1 B. Select M with lowest Err
Cross Validation Error 1.0 0.8 train.error 0.6 0.4 Train Error 0.2 10−fold CV Error Test Error 0.0 0 10 20 30 40 K
Recommend
More recommend