Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

advanced analytics in business d0s07a big data platforms
SMART_READER_LITE
LIVE PREVIEW

Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model Evaluation Overview Introduction Classification performance Regression performance Cross-validation and tuning Additional notes Monitoring and


slide-1
SLIDE 1

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Model Evaluation

slide-2
SLIDE 2

Overview

Introduction Classification performance Regression performance Cross-validation and tuning Additional notes Monitoring and maintenance

2

slide-3
SLIDE 3

The analytics process

3

slide-4
SLIDE 4

It’s all about generalization

You have trained a model on a particular data set (e.g. a decision tree) This is your train data (a.k.a. development, estimation): used to build model

Performance on your train data gives you an initial idea of your model’s validity But no much more than that

Much more important: ensure this model will do well on unseen data (out-of-time, out-of- sample, out-of-population)

As predictive models are going to be “put to work” Validation needed!

Test (a.k.a. hold-out) data: used to objectively measure performance! Strict separation between training and test set needed! 4

slide-5
SLIDE 5

It’s all about generalization

At the very least, use a test set

Typically 1/3 of data Stratification: same class distribution in training and test

5

slide-6
SLIDE 6

What do we want to validate?

Out-of-sample Out-of-time Out-of-population

Not possible to foresee everything that will happen in the future, as you are by definition limited to the data you have now

But your duty to be as thorough as possible

6

slide-7
SLIDE 7

Classification Performance

7

slide-8
SLIDE 8

True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61

Threshold: 0.50

Predicted Label Correct? no Correct no Correct yes Correct yes Correct yes Correct yes Incorrect no Incorrect no Correct no Incorrect yes Correct yes Correct yes Correct yes Correct yes Incorrect

Confusion matrix

8

slide-9
SLIDE 9

Confusion matrix

Depends on the threshold! 9

slide-10
SLIDE 10

Metrics

These depend on the confusion matrix, and hence on the threshold! 10

slide-11
SLIDE 11

Accuracy = (tp + tn) / total = (3 + 7) / 14 = 0.71 Balanced accuracy = (recall + specificity) / 2 = (0.5 * tp) / (tp + fn) + (0.5 * tn) / (tn + fp) = 0.5 * 0.78 + 0.5 * 0.60 = 0.69 Recall (sensitivity) = tp / (tp + fn) = 7 / 9 = 0.78 “How much of the positives did we predict as such?” Precision = tp / (tp + fp) = 7 / 9 = 0.78 “How much of the predicted positives are we getting wrong?”

Common metrics

11

slide-12
SLIDE 12

True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61

(Recall here our discussion on “well-calibrated” classifiers) (Note: one could also define multiple thresholds)

Tuning the threshold

For each possible threshold with the set of all predicted probabilities, we can obtain a confusion matrix and hence different metrics So which threshold to pick?

t ∈ T T

12

slide-13
SLIDE 13

Tuning the model?

For most models, it’s hard to push them towards optimizing your metric of choice

They’ll often inherently optimize for accuracy given the training set In most cases, you will be interested in something else

The class imbalance present in the training set might conflict with a model’s notion of accuracy You might want to focus on recall or precision, or…

What can we do?

Tuning the threshold on your metric of interest Adjust the model parameters Adjust the target definition Sample/filter the data set Apply misclassification costs Apply instance weighting (easy way to do this: duplicate instances) Adjust the loss function (if the model supports doing so, and even then oftentimes related to accuracy concern)

13

slide-14
SLIDE 14

Tuning the threshold

14

slide-15
SLIDE 15

Applying misclassification costs

Let’s go on a small detour… Let us illustrate the basic problem with a setting you’ll encounter often: a binary classification problem where the class of interest (the positive class) happens rarely compared to the negative class

Say fraud only occurs in 1% of cases in the training data

Almost all techniques you run out of the box will show this in your confusion matrix:

Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0

15

slide-16
SLIDE 16

Applying misclassification costs

What’s happening here?

Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0 Remember that the model will optimize for accuracy, and gets an accuracy of 99% That’s why you should never believe people that only report on accuracy

“No worries, I’ll just pick a stricter threshold”

Doesn’t always work! How do I tell my model that I am willing to make some mistakes on the negative side to catch the positives?

16

slide-17
SLIDE 17

Applying misclassification costs

What we would like to do is set misclassification costs as such:

Actual Negative Actual Positive Predicted Negative Predicted Positive

Mispredicting a positive as a negative is 5 times as bad as mispredicting a negative as a positive How to determine the costs?

Use real average observed costs (hard to find in many settings) Expert estimate Inverse class distribution…

C(0, 0) = 0 C(0, 1) = 5 C(1, 0) = 1 C(1, 1) = 0

17

slide-18
SLIDE 18

Applying misclassification costs

Inverse class distribution:

99% negative versus 1% positive Actual Negative Actual Positive Predicted Negative Predicted Positive

C(1, 0) = 0.99 = 1

1 0.99

C(0, 1) = 0.99 = 99

1 0.01

C(0, 0) = 0 C(0, 1) = 99 C(1, 0) = 1 C(1, 1) = 0

18

slide-19
SLIDE 19

Applying misclassification costs

With a given cost matrix (no matter how we define it), we can then calculate the expected loss

Actual Negative Actual Positive Predicted Negative Predicted Positive

is the expected loss for classifying an observation as class For binary classification:

C(0, 0) = 0 C(0, 1) = 5 C(1, 0) = 1 C(1, 1) = 0 l(x, j) x j = ∑k p(k|x)C(j, k) l(x, 0) = p(0|x)C(0, 0) + p(1|x)C(0, 1) = (here) p(1|x)C(0, 1) l(x, 1) = p(0|x)C(1, 0) + p(1|x)C(1, 1) = (here) p(0|x)C(1, 0)

19

slide-20
SLIDE 20

Applying misclassification costs

Classify an observation as positive if the expected loss for classifying it as a positive observation is smaller than the expected loss for classifying it as a negative observation:

→ classify as positive (1), negative (0) otherwise Actual Negative Actual Positive Predicted Negative Predicted Positive

Example: cost insensitive classifier predicts → Classify as positive

l(x, 1) < l(x, 0) C(0, 0) = 0 C(0, 1) = 5 C(1, 0) = 1 C(1, 1) = 0 p(1|x) = 0.22 l(x, 0) = p(0|x)C(0, 0) + p(1|x)C(0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l(x, 1) = p(0|x)C(1, 0) + p(1|x)C(1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78

20

slide-21
SLIDE 21

Applying misclassification costs

Remark: when and then

l(x, 1) = l(x, 0) p(0|x)C(0, 0) + p(1|x)C(0, 1) = p(0|x)C(1, 0) + p(1|x)C(1, 1) p(0|x) = 1 − p(1|x) p(1|x) = = TCS

C(1,0)−C(0,0) C(1,0)−C(0,0)+C(0,1)−C(1,1)

C(1, 0) = C(0, 1) = 1 C(1, 1) = C(0, 0) = 0 TCS = = 0.5

1−0 1−0+1−0

21

slide-22
SLIDE 22

Applying misclassification costs

Actual Negative Actual Positive Predicted Negative Predicted Positive

Example: cost insensitive classifier predicts → Classify as positive

C(0, 0) = 0 C(0, 1) = 5 C(1, 0) = 1 C(1, 1) = 0 p(1|x) = 0.22 l(x, 0) = p(0|x)C(0, 0) + p(1|x)C(0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l(x, 1) = p(0|x)C(1, 0) + p(1|x)C(1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78 TCS = = 0.1667 ≤ 0.22

1 1+5

22

slide-23
SLIDE 23

Sampling approaches

From the above, a new cost-sensitive class distribution can be obtained based on the cost-sensitive threshold as follows:

New positive number of observations Or, new negative number of observations

E.g. using 1 positive versus 99 negative (class inverse cost matrix):

Actual Negative Actual Positive Predicted Negative Predicted Positive

, or:

n′

1 = n1 1−TCS TCS

n′

0 = n0 TCS 1−TCS

C(0, 0) = 0 C(0, 1) = 99 C(1, 0) = 1 C(1, 1) = 0 TCS = = 0.01

1 1+99

n′

1 = 1

= 99

1−0.01 0.01

n′

0 = 99

= 1

0.01 1−0.01

23

slide-24
SLIDE 24

Sampling approaches

We now arrive at a nice conclusion: Sampling the data set so the minority class is equal to the majority class boils down to biasing the classifier in the same way as when you would use a cost matrix constructed from the inverse class imbalance

“ “

24

slide-25
SLIDE 25

Oversampling (upsampling)

25

slide-26
SLIDE 26

Undersampling (downsampling)

26

slide-27
SLIDE 27

Smart sampling

SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2002)

Oversample minority class by creating synthetic examples

Step 1: For each minority class observation, determine k (e.g., 1) nearest neighbors Step 2: Synthetic examples generated as between neighbors and instance

Can be combined with undersampling majority class

27

slide-28
SLIDE 28

Smart sampling

See e.g. imblearn : https://imbalanced-learn.readthedocs.io/en/stable/ 28

slide-29
SLIDE 29

Sampling approaches

Note: combinations of

  • ver/downsampling possible

You can also try oversampling the minority class above the 1:1 level (would boil down to using even more extreme costs in cost matrix) Very closely related to the field of “cost-sensitive learning”

Setting misclassification costs (some implementations allow this as well) Cost sensitive logistic regression Cost sensitive decision trees (uses modified entropy and information gain measures) Cost sensitive evaluation measures (e.g. Average Misclassification Cost)

29

slide-30
SLIDE 30

Sampling approaches

Only on your training set! Test set remains untouched

Basically, a way to indicate to the learner: both classes are as important On the test set, you can use AUC or the metric you are actually interested in Note that the accuracy on the test set after up/down sampling will most likely be lower than what you got in the “just always predict the majority class every time” case I.e. your model will now start to identify cases as being fraudulent… some of these will be false positives: price to pay to get out the true positives Remember precision versus recall trade-off Experimentation with the right amounts of over/undersampling required: depends on the setting SMOTE and other intelligent sampling techniques work well, but are not magic, you’ll still need some positives… Also, don’t expect SMOTE to create “hidden, future, …” cases of positive instances

Class imbalance occurs in many settings! 30

slide-31
SLIDE 31

Sampling approaches

Some techniques also support instance weighting: not defined per cell in the confusion matrix but per instance

Indicate that some instances are more important to get right Similar derivation is possible here: a rough approach consists of duplicating instance rows that are deemed more important Again: biasing the training in the same way Again: only in the training data (the fact that some instances are more important can then be evaluated with a corresponding evaluation scheme during testing)

31

slide-32
SLIDE 32

Sampling approaches

Sampling biases the training set and the probability ranges your model outputs. This is fine if you’re only interested in a ranking, but distorts a calibrated view

  • n the probabilities

In case this is important, you can unbias the probability output using (Saerens et al., 2002): With class , the biased probability (on the sampled data set), the prior probability (proportion) of class on the sampled training data set, and the original prior (proportion) before sampling (e.g. 1% vs. 99%)

punbiased(Ci|x) = ps(Ci|x)

p(Ci) ps(Ci)

∑m

j=1

ps(Cj|x)

p(Cj) ps(Cj)

Ci i ps(Ci|x) ps(Ci) Ci p(Ci)

32

slide-33
SLIDE 33

Example

library(caret) library(tidyverse) library(magrittr) library(ROCR) library(PRROC) library(ROSE) data <- read.csv('data.csv') table(data$TARGET) # 0 1 # 4748 252 train.index <- createDataPartition(data$TARGET, p = .7, list = FALSE) train <- data[ train.index,] test <- data[-train.index,] dtree <- train(TARGET ~ ., data = train, method = "rpart", tuneLength = 10) predictions <- predict(dtree, test, type='prob')

33

slide-34
SLIDE 34

Example

train.sampled <- ROSE(TARGET ~ ., data = train, p = 0.5)$data table(train.sampled$TARGET) # 0 1 # 1754 1747 dtree.sampled <- train(TARGET ~ ., data = train.sampled, method = "rpart", tuneLength = 10) predictions <- predict(dtree.sampled, test, type='prob')

34

slide-35
SLIDE 35

Example

After rescaling: 35

slide-36
SLIDE 36

(Back to) classification performance

Let’s get back on track We have seen in any case that accuracy is often not the only metric we should focus on

Recall and precision concerns much more important Depend on the threshold, however We have already seen a recall/precision curve Others?

36

slide-37
SLIDE 37

ROC curve

Make a table with sensitivity and specificity for each possible cut-off Receiver operating characteristic (ROC) curve plots sensitivity (tp rate) versus 1-specificity (fp rate) for each possible cut-off Perfect model has sensitivity of 1 and specificity of 1 (i.e. upper left corner) ROC curve can be summarized by the area underneath it (area under (RO) curve, AUC) AUC represents probability that a randomly chosen positive instance gets a higher score than a randomly chosen negative instance (Hanley and McNeil, 1983)

37

slide-38
SLIDE 38

True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61

ROC curve

38

slide-39
SLIDE 39

ROC curve

39

slide-40
SLIDE 40

ROC curve

But are they completely the same? 40

slide-41
SLIDE 41

ROC curve

Precision-recall or ROC curve?

Both use sensitivity, or recall (or true positive rate) The precision-recall curve uses precision, defined as the rate of predicted positives that were indeed positive, whereas the ROC curve uses 1-specificity, which is one minus the rate of negatives in the data set that were predicted as such, or the false positive rate A key difference is thus that precision doesn’t account for the true negatives. Precision measures the probability of a sample classified as a positive to actually be positive. The false positive rate measures the ratio of false positives within the negative samples, and so, these two are not exactly the same, and generally speaking, the ROC curve is preferred It is important to keep your setting in mind as well, especially when dealing with imbalanced data sets, where the false positive rate tends to increase more slowly because the true negatives most likely take up a significant part of the confusion matrix, making the false positive rate smaller, whereas precision is unaffected by this In other words, precision measures the probability of correct detection of positive cases, whereas false positive rate, together with the true positive rate, measure the ability to distinguish between the classes

41

slide-42
SLIDE 42

ROC curve

Precision-recall or ROC curve?

If the positive class is the minority class, and the ability to detect them is our main focus (correct detection of negatives is less important), precision and recall (and the corresponding precision-recall curve) can be preferred If the positive class is the majority class, or we want to give equal weight to our model’s ability to correctly detect both classes, sensitivity and specificity (and the corresponding ROC curve) should be preferred Note that both curves can also be constructed by switching the class labels around, e.g. when the negative class is the minority class or you want to focus on your model’s ability to correctly detect the negative cases

42

slide-43
SLIDE 43

ROC curve

https://arxiv.org/pdf/1812.01388.pdf

43

slide-44
SLIDE 44

ROC curve

ROC curve can be summarized by the area underneath (area under (RO) curve, AUC) Similar AUC for precision-recall curve exists

But: visual inspection and understanding required!

You might only be interested in a certain area of the curve “Weighted” approaches exist, but not commonly known about

Also see:

http://www.rduin.nl/presentations/ROC%20Tutorial%20Peter%20Flach/ROCtutorialPartI.pdf https://stats.stackexchange.com/questions/225210/accuracy-vs-area-under-the-roc- curve/225221#225221

44

slide-45
SLIDE 45

Cumulative accuracy profile (CAP)

Sort population from high to low score Measure (cumulative) percentage of positives for each score decile Lorenz curve, Power curve, Captured Event Plot

Note: AR = 2 × AUC − 1 45

slide-46
SLIDE 46

Lift

Assume a random model handing out random probabilities Take the top n% See how many of them were indeed “yes”, e.g. 10 / 100 Now do the same for your model, gives e.g. 60 / 100 Lift of your model over random is 60 / 10 = 6 Lift of 1: random sorting Can be done over distinct groups instead of cumulative

Recall/precision at n: same concept, for top ranked n observations

Especially important if shortlists need to be delivered E.g. common in the setting of recommender systems

46

slide-47
SLIDE 47

h-index

A coherent alternative to the area under the ROC curve (Hand, 2009) The area under the ROC curve (AUC) is a very widely used measure of performance for classification and diagnostic rules It has the appealing property of being objective, requiring no subjective input from the user On the other hand, the AUC has disadvantages For example, the AUC can give potentially misleading results if ROC curves cross It is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. This means that using the AUC is equivalent to using different metrics to evaluate different classification rules

Nice alternative, lesser used 47

slide-48
SLIDE 48

Kolmogorov-Smirnov (KS) distance

Separation measure Distance between cumulative score distributions and

P(s|y = 1) P(s|y = 0) KS = maxs|P(s|y = 1) − P(s|y = 0)|

48

slide-49
SLIDE 49

Mahalanobis distance

Measure the Mahalanobis distance between the two mean scores:

With σ the (pooled) standard deviation Better than Euclidean distance because it takes the distribution (standard deviation) of the scores into account

Closely related is the divergence measure D:

M = |μp − μn| σ D = (μp − μn)2 (σ2

p + σ2 n) 1 2

49

slide-50
SLIDE 50

Regression Performance

50

slide-51
SLIDE 51

Regression performance

Hypothesis tests on the coefficients with confidence intervals

: coefficient of determination: the proportion of variation in explained (“captured”) by the regression model

H0 : β1 = 0, H +

A : β1 > 0, H − A : β1 < 0

r2 y r2 = 1 − SSE Syy Syy = ∑n

i=1(yi − ¯

yi)2 SSE = ∑n

i=1(yi − ^

yi)2

51

slide-52
SLIDE 52

Scatter plot

Scatter plot between predicted and true value

Calculate e.g. Pearson correlation

y

52

slide-53
SLIDE 53

Other measures

AIC (Akaike Information Criterion):

A relative estimate of the information lost when a given model is used the represent the process that generates the data A trade-off between the goodness of fit and the complexity of the model

BIC (Bayesian Information Criterion), a.k.a. Schwarz criterion

Closely related to AIC

  • adjusted

Aadjusted for the number of predictors in the model Increases only if a new term improves the model more than would be expected by chance, decreases otherwise (r- squared would continue to increase even after dumping useless features in) Most implementations implement this, even if they might call it

Others: deviance information criterion, Hannan-Qionn information criterion, Jensen-Shannon divergence, Kullback-Leibler divergence, minimum message length, … Look at Mean Squared Error, Mean Absolute Deviation, Root Mean Squared Error, …

MSE = MAD = RMSE = (standard deviation for an unbiased model) Note: cost sensitive measures and tuning exists here as well (e.g. “BSZ tuning”, Bansal, Sinha, and Zhao): AMC =

( )

SSE n n+k n−k

r2 r2

a = 1 − (1 − r2)(

)

n−1 n−k

r2

∑n

i=1(yi−^

yi)2 n ∑n

i=1 |yi−^

yi| n

√MSE

∑n

i=1 C(yi−^

yi) n

slide-54
SLIDE 54

Regression performance

54

slide-55
SLIDE 55

Regression error characteristic (REC) curve

This is a regression variant of the ROC curve Plots the error tolerance on the X-axis versus the percentage of points predicted within the tolerance on the Y-axis The resulting curve estimates the cumulative distribution function of the error The error on the X-axis can be defined as the squared error or the absolute deviation or something else

55

slide-56
SLIDE 56

Regression performance

Perform some basic validation checks

Check residuals of the model Check variables with extreme coefficients (especially when applying regularization) Check the sign of the coefficients

Note that this applies for basically any model: don’t just train and look at the AUC, take a look at the top misclassified instances, would they be hard for you as well? How about the top correctly classified ones? Take a look at variable importance, position of features in tree, splitting points 56

slide-57
SLIDE 57

Regression performance

There’s a difference between “predicting the future” and “extrapolating from training data”!

Use the appropriate technique

Also applies to all model types 57

slide-58
SLIDE 58

Cross-validation and Tuning

58

slide-59
SLIDE 59

Cross-validation and tuning

59

slide-60
SLIDE 60

Cross-validation and tuning

Decision trees with early stopping: 60

slide-61
SLIDE 61

Cross-validation and tuning

General train-valid-test split: … how to prevent lucky hits? 61

slide-62
SLIDE 62

Cross-validation and tuning

62

slide-63
SLIDE 63

Cross-validation and tuning

63

slide-64
SLIDE 64

Cross-validation and tuning

… what is the final model here? 64

slide-65
SLIDE 65

Cross-validation and tuning

65

slide-66
SLIDE 66

Cross-validation and tuning

# Note that we are scaling the predictors glmnet_model <- train(annual_pm ~ ., data = dplyr::select(lur, -site_id), preProcess = c("center", "scale"), method = "glmnet", trControl = tr) arrange(glmnet_model$results, RMSE) %>% head ## alpha lambda RMSE Rsquared RMSESD RsquaredSD ## 1 0.10 0.330925285 1.046882 0.8213086 0.3711204 0.1662474 <-- ## 2 1.00 0.033092528 1.057797 0.8151413 0.3165820 0.1661203 ## 3 0.55 0.033092528 1.058651 0.8152392 0.3179481 0.1677805 ## 4 0.10 0.033092528 1.067397 0.8131885 0.3243109 0.1708488 ## 5 1.00 0.003309253 1.073726 0.8113261 0.3224757 0.1711788 ## 6 0.55 0.003309253 1.073969 0.8109472 0.3231762 0.1722758

66

slide-67
SLIDE 67

Cross-validation and tuning

Cross validation is a way to protect against overfitting and ensuring validation by adding diversity in repeated runs

Prevent lucky hits

Many different types exist:

Repeated (nested) cross validation Repeated out of time Leave one out cross-validation (an extreme form of cross-validation)

Cross validation is mainly for tuning, test set for true evaluation!

Note: preprocessing steps should be trained on folds included in CV-run

67

slide-68
SLIDE 68

Additional Notes

68

slide-69
SLIDE 69

What about multiclass?

Concept of confusion matrix still applies

https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

But: metrics somewhat harder to calculate (multiple “positive” classes possible here, so potentially multiple ROC curves that can be constructed and inspected!)

Averaging techniques across the curves

69

slide-70
SLIDE 70

What about multiclass?

What if your technique only supports binary classification to begin with? One simple approach is a transformation to binary:

One-vs.-all (one-vs.-rest):

Contrast every class against all other classes For k classes, build k classifiers Assign a new observation using the highest posterior probability

One-vs.-one:

Contrast every class against every (single) other class Pairwise approach For k classes, build k(k-1)/2 classifiers Assign a new observation using the majority voting rule

70

slide-71
SLIDE 71

One-vs.-all

71

slide-72
SLIDE 72

One-vs.-one

72

slide-73
SLIDE 73

What about multilabel?

Evaluation: specific definitions for precision, recall, Jaccard index, Hamming loss, i.e. adapted to incorporate the fact that an instance can have multiple labels What if your technique does not support it?

Transform into binary classification (“binary relevance method”)

Independently training one binary classifier for each label (instance has label yes/no) The combined model then predicts all labels for this sample for which the respective classifiers predict a positive result (“has label”) Not the same as one-vs.-one or one-vs.-all Does not consider label relationships, but simple Alternatives exist: e.g. classifier chaining, see e.g. scikit.ml : http://scikit.ml/

Transform into multi-class problem

Based on making the powerset over the labels E.g., if possible labels are Dog, Cat, Duck, the label powerset representation of this problem is a multi-class classification problem with the classes a:[0 0 0], b:[1 0 0], c:[0 1 0], d:[0 0 1], e:[1 1 0], f:[1 0 1], g:[0 1 1], h:[1 1 1] where e.g. [1 0 1] denotes an example where labels Dog and Duck are present and label Cat is absent Simple but leads to an explosion of classes! Better: ensemble methods or neural network based approaches

73

slide-74
SLIDE 74

Validation is hard

74

slide-75
SLIDE 75

Validation is hard

What if final test set evaluation gives bad results? (Throw away the whole project? Hunt for new data set?)

You should, but it happens Be sure to know the risks

Should feature engineering and transformation be done on the whole data set? (“It’s so hard not to”)

  • Definitely. (Python packages are often more sensible to this regard)

Even when waiting to use to final test set, too much re-use of same train/validation split leads to hidden overtraining (“I’ll just make a small parameter tuning”)

So does too much parameter combination runs (over-usage of the same data) Suddenly, the test set result will be dissapointing

Not using a test set is out of the question

Remember: only guarantee for generalization If data is really lacking, simple single-shot parametric models can be considered, but definitely with risks

Some models try to avoid overfitting by themselves (see later: bootstrapping) Also, if scores are too good to be true, they most likely are (target variable “leakage”)

75

slide-76
SLIDE 76

http://scikit-learn.org/stable/modules/calibration.html http://fastml.com/classifier-calibration-with-platts-scaling- and-isotonic-regression

As seen above, some models can give you poor estimates of the class probabilities and some even do not support probability prediction Sampling the training set also biases the probability distribution Logistic regression returns well calibrated predictions by default as it directly optimizes log-loss. In contrast, the other methods return biased probabilities; with different biases per method E.g. methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values Calibration methods exist to fix this

Probability calibration

76

slide-77
SLIDE 77

Monitoring and Maintenance

77

slide-78
SLIDE 78

Monitoring

Validation doesn’t stop at deployment

Input data

Distributions, check categorical levels, check missing values System stability index

Output predictions

Hard to monitor unless true outcomes are tracked But we can monitor prediction distribution

https://www.dataminingapps.com/2016/10/what-is-a-system-stability-index-ssi-and-how-can-it-be-used-to-monitor-population- stability/

78

slide-79
SLIDE 79

Monitoring

What to report, which performance metrics

“Does AUC matter?” Excel, scorecard, traffic lights API (REST) Oftentimes prediction probability is combined with another factor: risk, consequence, damage, value…

79

slide-80
SLIDE 80

Monitoring

Monitoring your population at deployment…

The goal is to set up a host of warnings which initiate a retraining (maintenance) trigger

80

slide-81
SLIDE 81

Monitoring

Visibility and Monitoring for Machine Learning Models There’s a great paper that I highly recommend you read by this guy named D. Sculley, who is a professor at Tufts, engineer at Google. He says machine learning is the high interest credit card of technical debt because machine learning is basically spaghetti code that you deploy on purpose. That’s essentially what machine learning is. You’re taking a bunch of data, generating a bunch of numbers and then putting it in a rush intentionally. And then trying to figure out, reverse engineer how does this thing actually work. There are a bunch of terrible downstream consequences to this. It’s a risky thing to do. So you only want to do it when you absolutely have to. http://blog.launchdarkly.com/visibility-and-monitoring-for-machine-learning-models/

What’s your ML test score? A rubric for production ML systems

https://research.google.com/pubs/pub45742.html

Hidden Technical Debt in Machine Learning Systems

https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems

81

slide-82
SLIDE 82

Monitoring

What’s your ML test score? A rubric for production ML systems

82

slide-83
SLIDE 83

The road to data science maturity

Domino Data Labs, https://www.dominodatalab.com/resources/data-science-maturity-model/

83

slide-84
SLIDE 84

Data science platforms as the solution?

A lot of “data science platforms” entered the market in previous years

H2O Domino Databricks Dataiku Anaconda MLflow CometML …

84

slide-85
SLIDE 85

Data science platforms as the solution?

Most of these focus towards the data scientist in the role of a model developer:

Versioning: for models (but also data?) Collaboration Scalable execution Multiple language/environment support

But it should also be about:

Reproducibility (model, data, environment freezing) Acyclic dependency graphs Monitoring Scheduling Checks, warning that retraining is in order Models as data

(More on this in the tooling session) 85