Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model Evaluation Overview Introduction Classification performance Regression performance Cross-validation and tuning Additional notes Monitoring and
Overview
Introduction Classification performance Regression performance Cross-validation and tuning Additional notes Monitoring and maintenance
2
The analytics process
3
It’s all about generalization
You have trained a model on a particular data set (e.g. a decision tree) This is your train data (a.k.a. development, estimation): used to build model
Performance on your train data gives you an initial idea of your model’s validity But no much more than that
Much more important: ensure this model will do well on unseen data (out-of-time, out-of- sample, out-of-population)
As predictive models are going to be “put to work” Validation needed!
Test (a.k.a. hold-out) data: used to objectively measure performance! Strict separation between training and test set needed! 4
It’s all about generalization
At the very least, use a test set
Typically 1/3 of data Stratification: same class distribution in training and test
5
What do we want to validate?
Out-of-sample Out-of-time Out-of-population
Not possible to foresee everything that will happen in the future, as you are by definition limited to the data you have now
But your duty to be as thorough as possible
6
Classification Performance
7
True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61
→
Threshold: 0.50
Predicted Label Correct? no Correct no Correct yes Correct yes Correct yes Correct yes Incorrect no Incorrect no Correct no Incorrect yes Correct yes Correct yes Correct yes Correct yes Incorrect
Confusion matrix
8
Confusion matrix
Depends on the threshold! 9
Metrics
These depend on the confusion matrix, and hence on the threshold! 10
Accuracy = (tp + tn) / total = (3 + 7) / 14 = 0.71 Balanced accuracy = (recall + specificity) / 2 = (0.5 * tp) / (tp + fn) + (0.5 * tn) / (tn + fp) = 0.5 * 0.78 + 0.5 * 0.60 = 0.69 Recall (sensitivity) = tp / (tp + fn) = 7 / 9 = 0.78 “How much of the positives did we predict as such?” Precision = tp / (tp + fp) = 7 / 9 = 0.78 “How much of the predicted positives are we getting wrong?”
Common metrics
11
True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61
→
(Recall here our discussion on “well-calibrated” classifiers) (Note: one could also define multiple thresholds)
Tuning the threshold
For each possible threshold with the set of all predicted probabilities, we can obtain a confusion matrix and hence different metrics So which threshold to pick?
t ∈ T T
12
Tuning the model?
For most models, it’s hard to push them towards optimizing your metric of choice
They’ll often inherently optimize for accuracy given the training set In most cases, you will be interested in something else
The class imbalance present in the training set might conflict with a model’s notion of accuracy You might want to focus on recall or precision, or…
What can we do?
Tuning the threshold on your metric of interest Adjust the model parameters Adjust the target definition Sample/filter the data set Apply misclassification costs Apply instance weighting (easy way to do this: duplicate instances) Adjust the loss function (if the model supports doing so, and even then oftentimes related to accuracy concern)
13
Tuning the threshold
14
Applying misclassification costs
Let’s go on a small detour… Let us illustrate the basic problem with a setting you’ll encounter often: a binary classification problem where the class of interest (the positive class) happens rarely compared to the negative class
Say fraud only occurs in 1% of cases in the training data
Almost all techniques you run out of the box will show this in your confusion matrix:
Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0
15
Applying misclassification costs
What’s happening here?
Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0 Remember that the model will optimize for accuracy, and gets an accuracy of 99% That’s why you should never believe people that only report on accuracy
“No worries, I’ll just pick a stricter threshold”
Doesn’t always work! How do I tell my model that I am willing to make some mistakes on the negative side to catch the positives?
16
Applying misclassification costs
What we would like to do is set misclassification costs as such:
Actual Negative Actual Positive Predicted Negative Predicted Positive
Mispredicting a positive as a negative is 5 times as bad as mispredicting a negative as a positive How to determine the costs?
Use real average observed costs (hard to find in many settings) Expert estimate Inverse class distribution…
C(0, 0) = 0 C(0, 1) = 5 C(1, 0) = 1 C(1, 1) = 0
17
Applying misclassification costs
Inverse class distribution:
99% negative versus 1% positive Actual Negative Actual Positive Predicted Negative Predicted Positive
C(1, 0) = 0.99 = 1
1 0.99
C(0, 1) = 0.99 = 99
1 0.01
C(0, 0) = 0 C(0, 1) = 99 C(1, 0) = 1 C(1, 1) = 0
18
Applying misclassification costs
With a given cost matrix (no matter how we define it), we can then calculate the expected loss
Actual Negative Actual Positive Predicted Negative Predicted Positive
is the expected loss for classifying an observation as class For binary classification:
C(0, 0) = 0 C(0, 1) = 5 C(1, 0) = 1 C(1, 1) = 0 l(x, j) x j = ∑k p(k|x)C(j, k) l(x, 0) = p(0|x)C(0, 0) + p(1|x)C(0, 1) = (here) p(1|x)C(0, 1) l(x, 1) = p(0|x)C(1, 0) + p(1|x)C(1, 1) = (here) p(0|x)C(1, 0)
19
Applying misclassification costs
Classify an observation as positive if the expected loss for classifying it as a positive observation is smaller than the expected loss for classifying it as a negative observation:
→ classify as positive (1), negative (0) otherwise Actual Negative Actual Positive Predicted Negative Predicted Positive
Example: cost insensitive classifier predicts → Classify as positive
l(x, 1) < l(x, 0) C(0, 0) = 0 C(0, 1) = 5 C(1, 0) = 1 C(1, 1) = 0 p(1|x) = 0.22 l(x, 0) = p(0|x)C(0, 0) + p(1|x)C(0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l(x, 1) = p(0|x)C(1, 0) + p(1|x)C(1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78
20
Applying misclassification costs
Remark: when and then
l(x, 1) = l(x, 0) p(0|x)C(0, 0) + p(1|x)C(0, 1) = p(0|x)C(1, 0) + p(1|x)C(1, 1) p(0|x) = 1 − p(1|x) p(1|x) = = TCS
C(1,0)−C(0,0) C(1,0)−C(0,0)+C(0,1)−C(1,1)
C(1, 0) = C(0, 1) = 1 C(1, 1) = C(0, 0) = 0 TCS = = 0.5
1−0 1−0+1−0
21
Applying misclassification costs
Actual Negative Actual Positive Predicted Negative Predicted Positive
Example: cost insensitive classifier predicts → Classify as positive
C(0, 0) = 0 C(0, 1) = 5 C(1, 0) = 1 C(1, 1) = 0 p(1|x) = 0.22 l(x, 0) = p(0|x)C(0, 0) + p(1|x)C(0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l(x, 1) = p(0|x)C(1, 0) + p(1|x)C(1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78 TCS = = 0.1667 ≤ 0.22
1 1+5
22
Sampling approaches
From the above, a new cost-sensitive class distribution can be obtained based on the cost-sensitive threshold as follows:
New positive number of observations Or, new negative number of observations
E.g. using 1 positive versus 99 negative (class inverse cost matrix):
Actual Negative Actual Positive Predicted Negative Predicted Positive
, or:
n′
1 = n1 1−TCS TCS
n′
0 = n0 TCS 1−TCS
C(0, 0) = 0 C(0, 1) = 99 C(1, 0) = 1 C(1, 1) = 0 TCS = = 0.01
1 1+99
n′
1 = 1
= 99
1−0.01 0.01
n′
0 = 99
= 1
0.01 1−0.01
23
Sampling approaches
We now arrive at a nice conclusion: Sampling the data set so the minority class is equal to the majority class boils down to biasing the classifier in the same way as when you would use a cost matrix constructed from the inverse class imbalance
“ “
24
Oversampling (upsampling)
25
Undersampling (downsampling)
26
Smart sampling
SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2002)
Oversample minority class by creating synthetic examples
Step 1: For each minority class observation, determine k (e.g., 1) nearest neighbors Step 2: Synthetic examples generated as between neighbors and instance
Can be combined with undersampling majority class
27
Smart sampling
See e.g. imblearn : https://imbalanced-learn.readthedocs.io/en/stable/ 28
Sampling approaches
Note: combinations of
- ver/downsampling possible
You can also try oversampling the minority class above the 1:1 level (would boil down to using even more extreme costs in cost matrix) Very closely related to the field of “cost-sensitive learning”
Setting misclassification costs (some implementations allow this as well) Cost sensitive logistic regression Cost sensitive decision trees (uses modified entropy and information gain measures) Cost sensitive evaluation measures (e.g. Average Misclassification Cost)
29
Sampling approaches
Only on your training set! Test set remains untouched
Basically, a way to indicate to the learner: both classes are as important On the test set, you can use AUC or the metric you are actually interested in Note that the accuracy on the test set after up/down sampling will most likely be lower than what you got in the “just always predict the majority class every time” case I.e. your model will now start to identify cases as being fraudulent… some of these will be false positives: price to pay to get out the true positives Remember precision versus recall trade-off Experimentation with the right amounts of over/undersampling required: depends on the setting SMOTE and other intelligent sampling techniques work well, but are not magic, you’ll still need some positives… Also, don’t expect SMOTE to create “hidden, future, …” cases of positive instances
Class imbalance occurs in many settings! 30
Sampling approaches
Some techniques also support instance weighting: not defined per cell in the confusion matrix but per instance
Indicate that some instances are more important to get right Similar derivation is possible here: a rough approach consists of duplicating instance rows that are deemed more important Again: biasing the training in the same way Again: only in the training data (the fact that some instances are more important can then be evaluated with a corresponding evaluation scheme during testing)
31
Sampling approaches
Sampling biases the training set and the probability ranges your model outputs. This is fine if you’re only interested in a ranking, but distorts a calibrated view
- n the probabilities
In case this is important, you can unbias the probability output using (Saerens et al., 2002): With class , the biased probability (on the sampled data set), the prior probability (proportion) of class on the sampled training data set, and the original prior (proportion) before sampling (e.g. 1% vs. 99%)
punbiased(Ci|x) = ps(Ci|x)
p(Ci) ps(Ci)
∑m
j=1
ps(Cj|x)
p(Cj) ps(Cj)
Ci i ps(Ci|x) ps(Ci) Ci p(Ci)
32
Example
library(caret) library(tidyverse) library(magrittr) library(ROCR) library(PRROC) library(ROSE) data <- read.csv('data.csv') table(data$TARGET) # 0 1 # 4748 252 train.index <- createDataPartition(data$TARGET, p = .7, list = FALSE) train <- data[ train.index,] test <- data[-train.index,] dtree <- train(TARGET ~ ., data = train, method = "rpart", tuneLength = 10) predictions <- predict(dtree, test, type='prob')
33
Example
train.sampled <- ROSE(TARGET ~ ., data = train, p = 0.5)$data table(train.sampled$TARGET) # 0 1 # 1754 1747 dtree.sampled <- train(TARGET ~ ., data = train.sampled, method = "rpart", tuneLength = 10) predictions <- predict(dtree.sampled, test, type='prob')
34
Example
After rescaling: 35
(Back to) classification performance
Let’s get back on track We have seen in any case that accuracy is often not the only metric we should focus on
Recall and precision concerns much more important Depend on the threshold, however We have already seen a recall/precision curve Others?
36
ROC curve
Make a table with sensitivity and specificity for each possible cut-off Receiver operating characteristic (ROC) curve plots sensitivity (tp rate) versus 1-specificity (fp rate) for each possible cut-off Perfect model has sensitivity of 1 and specificity of 1 (i.e. upper left corner) ROC curve can be summarized by the area underneath it (area under (RO) curve, AUC) AUC represents probability that a randomly chosen positive instance gets a higher score than a randomly chosen negative instance (Hanley and McNeil, 1983)
37
True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61
→
ROC curve
38
ROC curve
39
ROC curve
But are they completely the same? 40
ROC curve
Precision-recall or ROC curve?
Both use sensitivity, or recall (or true positive rate) The precision-recall curve uses precision, defined as the rate of predicted positives that were indeed positive, whereas the ROC curve uses 1-specificity, which is one minus the rate of negatives in the data set that were predicted as such, or the false positive rate A key difference is thus that precision doesn’t account for the true negatives. Precision measures the probability of a sample classified as a positive to actually be positive. The false positive rate measures the ratio of false positives within the negative samples, and so, these two are not exactly the same, and generally speaking, the ROC curve is preferred It is important to keep your setting in mind as well, especially when dealing with imbalanced data sets, where the false positive rate tends to increase more slowly because the true negatives most likely take up a significant part of the confusion matrix, making the false positive rate smaller, whereas precision is unaffected by this In other words, precision measures the probability of correct detection of positive cases, whereas false positive rate, together with the true positive rate, measure the ability to distinguish between the classes
41
ROC curve
Precision-recall or ROC curve?
If the positive class is the minority class, and the ability to detect them is our main focus (correct detection of negatives is less important), precision and recall (and the corresponding precision-recall curve) can be preferred If the positive class is the majority class, or we want to give equal weight to our model’s ability to correctly detect both classes, sensitivity and specificity (and the corresponding ROC curve) should be preferred Note that both curves can also be constructed by switching the class labels around, e.g. when the negative class is the minority class or you want to focus on your model’s ability to correctly detect the negative cases
42
ROC curve
https://arxiv.org/pdf/1812.01388.pdf
43
ROC curve
ROC curve can be summarized by the area underneath (area under (RO) curve, AUC) Similar AUC for precision-recall curve exists
But: visual inspection and understanding required!
You might only be interested in a certain area of the curve “Weighted” approaches exist, but not commonly known about
Also see:
http://www.rduin.nl/presentations/ROC%20Tutorial%20Peter%20Flach/ROCtutorialPartI.pdf https://stats.stackexchange.com/questions/225210/accuracy-vs-area-under-the-roc- curve/225221#225221
44
Cumulative accuracy profile (CAP)
Sort population from high to low score Measure (cumulative) percentage of positives for each score decile Lorenz curve, Power curve, Captured Event Plot
Note: AR = 2 × AUC − 1 45
Lift
Assume a random model handing out random probabilities Take the top n% See how many of them were indeed “yes”, e.g. 10 / 100 Now do the same for your model, gives e.g. 60 / 100 Lift of your model over random is 60 / 10 = 6 Lift of 1: random sorting Can be done over distinct groups instead of cumulative
Recall/precision at n: same concept, for top ranked n observations
Especially important if shortlists need to be delivered E.g. common in the setting of recommender systems
46
h-index
A coherent alternative to the area under the ROC curve (Hand, 2009) The area under the ROC curve (AUC) is a very widely used measure of performance for classification and diagnostic rules It has the appealing property of being objective, requiring no subjective input from the user On the other hand, the AUC has disadvantages For example, the AUC can give potentially misleading results if ROC curves cross It is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. This means that using the AUC is equivalent to using different metrics to evaluate different classification rules
Nice alternative, lesser used 47
Kolmogorov-Smirnov (KS) distance
Separation measure Distance between cumulative score distributions and
P(s|y = 1) P(s|y = 0) KS = maxs|P(s|y = 1) − P(s|y = 0)|
48
Mahalanobis distance
Measure the Mahalanobis distance between the two mean scores:
With σ the (pooled) standard deviation Better than Euclidean distance because it takes the distribution (standard deviation) of the scores into account
Closely related is the divergence measure D:
M = |μp − μn| σ D = (μp − μn)2 (σ2
p + σ2 n) 1 2
49
Regression Performance
50
Regression performance
Hypothesis tests on the coefficients with confidence intervals
: coefficient of determination: the proportion of variation in explained (“captured”) by the regression model
H0 : β1 = 0, H +
A : β1 > 0, H − A : β1 < 0
r2 y r2 = 1 − SSE Syy Syy = ∑n
i=1(yi − ¯
yi)2 SSE = ∑n
i=1(yi − ^
yi)2
51
Scatter plot
Scatter plot between predicted and true value
Calculate e.g. Pearson correlation
y
52
Other measures
AIC (Akaike Information Criterion):
A relative estimate of the information lost when a given model is used the represent the process that generates the data A trade-off between the goodness of fit and the complexity of the model
BIC (Bayesian Information Criterion), a.k.a. Schwarz criterion
Closely related to AIC
- adjusted
Aadjusted for the number of predictors in the model Increases only if a new term improves the model more than would be expected by chance, decreases otherwise (r- squared would continue to increase even after dumping useless features in) Most implementations implement this, even if they might call it
Others: deviance information criterion, Hannan-Qionn information criterion, Jensen-Shannon divergence, Kullback-Leibler divergence, minimum message length, … Look at Mean Squared Error, Mean Absolute Deviation, Root Mean Squared Error, …
MSE = MAD = RMSE = (standard deviation for an unbiased model) Note: cost sensitive measures and tuning exists here as well (e.g. “BSZ tuning”, Bansal, Sinha, and Zhao): AMC =
( )
SSE n n+k n−k
r2 r2
a = 1 − (1 − r2)(
)
n−1 n−k
r2
∑n
i=1(yi−^
yi)2 n ∑n
i=1 |yi−^
yi| n
√MSE
∑n
i=1 C(yi−^
yi) n
Regression performance
54
Regression error characteristic (REC) curve
This is a regression variant of the ROC curve Plots the error tolerance on the X-axis versus the percentage of points predicted within the tolerance on the Y-axis The resulting curve estimates the cumulative distribution function of the error The error on the X-axis can be defined as the squared error or the absolute deviation or something else
55
Regression performance
Perform some basic validation checks
Check residuals of the model Check variables with extreme coefficients (especially when applying regularization) Check the sign of the coefficients
Note that this applies for basically any model: don’t just train and look at the AUC, take a look at the top misclassified instances, would they be hard for you as well? How about the top correctly classified ones? Take a look at variable importance, position of features in tree, splitting points 56
Regression performance
There’s a difference between “predicting the future” and “extrapolating from training data”!
Use the appropriate technique
Also applies to all model types 57
Cross-validation and Tuning
58
Cross-validation and tuning
59
Cross-validation and tuning
Decision trees with early stopping: 60
Cross-validation and tuning
General train-valid-test split: … how to prevent lucky hits? 61
Cross-validation and tuning
62
Cross-validation and tuning
63
Cross-validation and tuning
… what is the final model here? 64
Cross-validation and tuning
65
Cross-validation and tuning
# Note that we are scaling the predictors glmnet_model <- train(annual_pm ~ ., data = dplyr::select(lur, -site_id), preProcess = c("center", "scale"), method = "glmnet", trControl = tr) arrange(glmnet_model$results, RMSE) %>% head ## alpha lambda RMSE Rsquared RMSESD RsquaredSD ## 1 0.10 0.330925285 1.046882 0.8213086 0.3711204 0.1662474 <-- ## 2 1.00 0.033092528 1.057797 0.8151413 0.3165820 0.1661203 ## 3 0.55 0.033092528 1.058651 0.8152392 0.3179481 0.1677805 ## 4 0.10 0.033092528 1.067397 0.8131885 0.3243109 0.1708488 ## 5 1.00 0.003309253 1.073726 0.8113261 0.3224757 0.1711788 ## 6 0.55 0.003309253 1.073969 0.8109472 0.3231762 0.1722758
66
Cross-validation and tuning
Cross validation is a way to protect against overfitting and ensuring validation by adding diversity in repeated runs
Prevent lucky hits
Many different types exist:
Repeated (nested) cross validation Repeated out of time Leave one out cross-validation (an extreme form of cross-validation)
Cross validation is mainly for tuning, test set for true evaluation!
Note: preprocessing steps should be trained on folds included in CV-run
67
Additional Notes
68
What about multiclass?
Concept of confusion matrix still applies
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
But: metrics somewhat harder to calculate (multiple “positive” classes possible here, so potentially multiple ROC curves that can be constructed and inspected!)
Averaging techniques across the curves
69
What about multiclass?
What if your technique only supports binary classification to begin with? One simple approach is a transformation to binary:
One-vs.-all (one-vs.-rest):
Contrast every class against all other classes For k classes, build k classifiers Assign a new observation using the highest posterior probability
One-vs.-one:
Contrast every class against every (single) other class Pairwise approach For k classes, build k(k-1)/2 classifiers Assign a new observation using the majority voting rule
70
One-vs.-all
71
One-vs.-one
72
What about multilabel?
Evaluation: specific definitions for precision, recall, Jaccard index, Hamming loss, i.e. adapted to incorporate the fact that an instance can have multiple labels What if your technique does not support it?
Transform into binary classification (“binary relevance method”)
Independently training one binary classifier for each label (instance has label yes/no) The combined model then predicts all labels for this sample for which the respective classifiers predict a positive result (“has label”) Not the same as one-vs.-one or one-vs.-all Does not consider label relationships, but simple Alternatives exist: e.g. classifier chaining, see e.g. scikit.ml : http://scikit.ml/
Transform into multi-class problem
Based on making the powerset over the labels E.g., if possible labels are Dog, Cat, Duck, the label powerset representation of this problem is a multi-class classification problem with the classes a:[0 0 0], b:[1 0 0], c:[0 1 0], d:[0 0 1], e:[1 1 0], f:[1 0 1], g:[0 1 1], h:[1 1 1] where e.g. [1 0 1] denotes an example where labels Dog and Duck are present and label Cat is absent Simple but leads to an explosion of classes! Better: ensemble methods or neural network based approaches
73
Validation is hard
74
Validation is hard
What if final test set evaluation gives bad results? (Throw away the whole project? Hunt for new data set?)
You should, but it happens Be sure to know the risks
Should feature engineering and transformation be done on the whole data set? (“It’s so hard not to”)
- Definitely. (Python packages are often more sensible to this regard)
Even when waiting to use to final test set, too much re-use of same train/validation split leads to hidden overtraining (“I’ll just make a small parameter tuning”)
So does too much parameter combination runs (over-usage of the same data) Suddenly, the test set result will be dissapointing
Not using a test set is out of the question
Remember: only guarantee for generalization If data is really lacking, simple single-shot parametric models can be considered, but definitely with risks
Some models try to avoid overfitting by themselves (see later: bootstrapping) Also, if scores are too good to be true, they most likely are (target variable “leakage”)
75
http://scikit-learn.org/stable/modules/calibration.html http://fastml.com/classifier-calibration-with-platts-scaling- and-isotonic-regression
As seen above, some models can give you poor estimates of the class probabilities and some even do not support probability prediction Sampling the training set also biases the probability distribution Logistic regression returns well calibrated predictions by default as it directly optimizes log-loss. In contrast, the other methods return biased probabilities; with different biases per method E.g. methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values Calibration methods exist to fix this
Probability calibration
76
Monitoring and Maintenance
77
Monitoring
Validation doesn’t stop at deployment
Input data
Distributions, check categorical levels, check missing values System stability index
Output predictions
Hard to monitor unless true outcomes are tracked But we can monitor prediction distribution
https://www.dataminingapps.com/2016/10/what-is-a-system-stability-index-ssi-and-how-can-it-be-used-to-monitor-population- stability/
78
Monitoring
What to report, which performance metrics
“Does AUC matter?” Excel, scorecard, traffic lights API (REST) Oftentimes prediction probability is combined with another factor: risk, consequence, damage, value…
79
Monitoring
Monitoring your population at deployment…
The goal is to set up a host of warnings which initiate a retraining (maintenance) trigger
80
Monitoring
Visibility and Monitoring for Machine Learning Models There’s a great paper that I highly recommend you read by this guy named D. Sculley, who is a professor at Tufts, engineer at Google. He says machine learning is the high interest credit card of technical debt because machine learning is basically spaghetti code that you deploy on purpose. That’s essentially what machine learning is. You’re taking a bunch of data, generating a bunch of numbers and then putting it in a rush intentionally. And then trying to figure out, reverse engineer how does this thing actually work. There are a bunch of terrible downstream consequences to this. It’s a risky thing to do. So you only want to do it when you absolutely have to. http://blog.launchdarkly.com/visibility-and-monitoring-for-machine-learning-models/
What’s your ML test score? A rubric for production ML systems
https://research.google.com/pubs/pub45742.html
Hidden Technical Debt in Machine Learning Systems
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems
81
Monitoring
What’s your ML test score? A rubric for production ML systems
82
The road to data science maturity
Domino Data Labs, https://www.dominodatalab.com/resources/data-science-maturity-model/
83
Data science platforms as the solution?
A lot of “data science platforms” entered the market in previous years
H2O Domino Databricks Dataiku Anaconda MLflow CometML …
84
Data science platforms as the solution?
Most of these focus towards the data scientist in the role of a model developer:
Versioning: for models (but also data?) Collaboration Scalable execution Multiple language/environment support
But it should also be about:
Reproducibility (model, data, environment freezing) Acyclic dependency graphs Monitoring Scheduling Checks, warning that retraining is in order Models as data
(More on this in the tooling session) 85