 
              Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model Evaluation
Overview Introduction Classification performance Regression performance Cross-validation and tuning Additional notes Monitoring and maintenance 2
The analytics process 3
It’s all about generalization You have trained a model on a particular data set (e.g. a decision tree) This is your train data (a.k.a. development, estimation): used to build model Performance on your train data gives you an initial idea of your model’s validity But no much more than that Much more important: ensure this model will do well on unseen data (out-of-time, out-of- sample, out-of-population) As predictive models are going to be “put to work” Validation needed! Test (a.k.a. hold-out) data: used to objectively measure performance! Strict separation between training and test set needed! 4
It’s all about generalization At the very least, use a test set Typically 1/3 of data Stratification: same class distribution in training and test 5
What do we want to validate? Out-of-sample Out-of-time Out-of-population Not possible to foresee everything that will happen in the future, as you are by definition limited to the data you have now But your duty to be as thorough as possible 6
Classification Performance 7
Confusion matrix True Label Prediction Predicted Label Correct? no 0.11 no Correct no 0.2 no Correct yes 0.85 yes Correct → yes 0.84 yes Correct yes 0.8 yes Correct no 0.65 yes Incorrect yes 0.44 no Incorrect no 0.1 no Correct yes 0.32 no Incorrect Threshold: 0.50 yes 0.87 yes Correct yes 0.61 yes Correct yes 0.6 yes Correct yes 0.78 yes Correct no 0.61 yes Incorrect 8
Confusion matrix Depends on the threshold! 9
Metrics These depend on the confusion matrix, and hence on the threshold! 10
Common metrics Accuracy = (tp + tn) / total = (3 + 7) / 14 = 0.71 Balanced accuracy = (recall + specificity) / 2 = (0.5 * tp) / (tp + fn) + (0.5 * tn) / (tn + fp) = 0.5 * 0.78 + 0.5 * 0.60 = 0.69 Recall (sensitivity) = tp / (tp + fn) = 7 / 9 = 0.78 “How much of the positives did we predict as such?” Precision = tp / (tp + fp) = 7 / 9 = 0.78 “How much of the predicted positives are we getting wrong?” 11
Tuning the threshold For each possible threshold with the set of all predicted probabilities, we can obtain a t ∈ T T confusion matrix and hence different metrics So which threshold to pick? True Label Prediction no 0.11 no 0.2 yes 0.85 → yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 (Recall here our discussion on “well-calibrated” yes 0.6 classifiers) yes 0.78 (Note: one could also define multiple thresholds) no 0.61 12
Tuning the model? For most models, it’s hard to push them towards optimizing your metric of choice They’ll often inherently optimize for accuracy given the training set In most cases, you will be interested in something else The class imbalance present in the training set might conflict with a model’s notion of accuracy You might want to focus on recall or precision, or… What can we do? Tuning the threshold on your metric of interest Adjust the model parameters Adjust the target definition Sample/filter the data set Apply misclassification costs Apply instance weighting (easy way to do this: duplicate instances) Adjust the loss function (if the model supports doing so, and even then oftentimes related to accuracy concern) 13
Tuning the threshold 14
Applying misclassification costs Let’s go on a small detour… Let us illustrate the basic problem with a setting you’ll encounter often: a binary classification problem where the class of interest (the positive class) happens rarely compared to the negative class Say fraud only occurs in 1% of cases in the training data Almost all techniques you run out of the box will show this in your confusion matrix: Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0 15
Applying misclassification costs What’s happening here? Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0 Remember that the model will optimize for accuracy, and gets an accuracy of 99% That’s why you should never believe people that only report on accuracy “No worries, I’ll just pick a stricter threshold” Doesn’t always work! How do I tell my model that I am willing to make some mistakes on the negative side to catch the positives? 16
Applying misclassification costs What we would like to do is set misclassification costs as such: Actual Negative Actual Positive Predicted Negative C (0, 0) = 0 C (0, 1) = 5 Predicted Positive C (1, 0) = 1 C (1, 1) = 0 Mispredicting a positive as a negative is 5 times as bad as mispredicting a negative as a positive How to determine the costs? Use real average observed costs (hard to find in many settings) Expert estimate Inverse class distribution… 17
Applying misclassification costs Inverse class distribution: 99% negative versus 1% positive 1 C (1, 0) = 0.99 = 1 0.99 1 C (0, 1) = 0.99 = 99 0.01 Actual Negative Actual Positive Predicted Negative C (0, 0) = 0 C (0, 1) = 99 Predicted Positive C (1, 0) = 1 C (1, 1) = 0 18
Applying misclassification costs With a given cost matrix (no matter how we define it), we can then calculate the expected loss Actual Negative Actual Positive Predicted Negative C (0, 0) = 0 C (0, 1) = 5 Predicted Positive C (1, 0) = 1 C (1, 1) = 0 is the expected loss for classifying an observation as class l ( x , j ) x j = ∑ k p ( k | x ) C ( j , k ) For binary classification: l ( x , 0) = p (0| x ) C (0, 0) + p (1| x ) C (0, 1) = (here) p (1| x ) C (0, 1) l ( x , 1) = p (0| x ) C (1, 0) + p (1| x ) C (1, 1) = (here) p (0| x ) C (1, 0) 19
Applying misclassification costs Classify an observation as positive if the expected loss for classifying it as a positive observation is smaller than the expected loss for classifying it as a negative observation: → classify as positive (1), negative (0) otherwise l ( x , 1) < l ( x , 0) Actual Negative Actual Positive Predicted Negative C (0, 0) = 0 C (0, 1) = 5 Predicted Positive C (1, 0) = 1 C (1, 1) = 0 Example: cost insensitive classifier predicts p (1| x ) = 0.22 l ( x , 0) = p (0| x ) C (0, 0) + p (1| x ) C (0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l ( x , 1) = p (0| x ) C (1, 0) + p (1| x ) C (1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78 → Classify as positive 20
Applying misclassification costs l ( x , 1) = l ( x , 0) p (0| x ) C (0, 0) + p (1| x ) C (0, 1) = p (0| x ) C (1, 0) + p (1| x ) C (1, 1) p (0| x ) = 1 − p (1| x ) C (1,0)− C (0,0) p (1| x ) = = T CS C (1,0)− C (0,0)+ C (0,1)− C (1,1) Remark: when and then 1−0 C (1, 0) = C (0, 1) = 1 C (1, 1) = C (0, 0) = 0 T CS = = 0.5 1−0+1−0 21
Applying misclassification costs Actual Negative Actual Positive Predicted Negative C (0, 0) = 0 C (0, 1) = 5 Predicted Positive C (1, 0) = 1 C (1, 1) = 0 Example: cost insensitive classifier predicts p (1| x ) = 0.22 l ( x , 0) = p (0| x ) C (0, 0) + p (1| x ) C (0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l ( x , 1) = p (0| x ) C (1, 0) + p (1| x ) C (1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78 → Classify as positive 1 T CS = = 0.1667 ≤ 0.22 1+5 22
Sampling approaches From the above, a new cost-sensitive class distribution can be obtained based on the cost-sensitive threshold as follows: New positive number of observations 1− T CS n ′ 1 = n 1 T CS Or, new negative number of observations T CS n ′ 0 = n 0 1− T CS E.g. using 1 positive versus 99 negative (class inverse cost matrix): Actual Negative Actual Positive Predicted Negative C (0, 0) = 0 C (0, 1) = 99 Predicted Positive C (1, 0) = 1 C (1, 1) = 0 1 T CS = = 0.01 1+99 , or: 1−0.01 0.01 n ′ n ′ 1 = 1 = 99 0 = 99 = 1 0.01 1−0.01 23
Sampling approaches We now arrive at a nice conclusion: “ Sampling the data set so the minority class is equal to the majority class boils down to biasing the classifier in the same way as when you would “ use a cost matrix constructed from the inverse class imbalance 24
Oversampling (upsampling) 25
Undersampling (downsampling) 26
Smart sampling SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2002) Oversample minority class by creating synthetic examples Step 1: For each minority class observation, determine k (e.g., 1) nearest neighbors Step 2: Synthetic examples generated as between neighbors and instance Can be combined with undersampling majority class 27
Smart sampling See e.g. imblearn : https://imbalanced-learn.readthedocs.io/en/stable/ 28
Sampling approaches Note: combinations of over/downsampling possible You can also try oversampling the minority class above the 1:1 level (would boil down to using even more extreme costs in cost matrix) Very closely related to the field of “cost-sensitive learning” Setting misclassification costs (some implementations allow this as well) Cost sensitive logistic regression Cost sensitive decision trees (uses modified entropy and information gain measures) Cost sensitive evaluation measures (e.g. Average Misclassification Cost) 29
Recommend
More recommend