EXECUTIVE BRIEFING: WHY MACHINE LEARNED MODELS CRASH AND BURN IN PRODUCTION (AND WHAT TO DO ABOUT IT) Dr. David Talby
MODEL DEVELOPMENT ≠ SOFTWARE DEVELOPMENT
1. The moment you put a model in production, it starts degrading
GARBABE IN, GARBAGE OUT [Sanders & Saxe, Sophos Group, Proceedings of Blackhat 2017] “The greatest model, trained on data inconsistent with the data it actually faces in the real world, will at best perform unreliably, and at worst fail catastrophically.” 4
CONCEPT DRIFT: AN EXAMPLE • Locality (epidemics) Medical claims > 4.7 Billion • Seasonality Pharmacy claims > 1.2 Billion • Changes in the hospital / population Providers > 500,000 • Impact of deploying the system Patients > 120 million • Combination of all of the above
[D. Sculley et al., Google, NIPS 2015] 6
HOW FAST DEPENDS ON THE PROBLEM (MUCH MORE THAN ON YOUR ALGORITHM) Always Changing Never Changing Cyber Google or Political & Security Amazon Economic Search Natural Banking & Models models Language, Social eCommerce Behavior Models fraud Physical models: Face recognition Voice recognition Online Social Climate models Automated trading Networking Real-time ad bidding Models/Rules
SO PUT THE RIGHT PLATFORM IN PLACE (MEASURE, RETRAIN, REDEPLOY) Always Changing Never Changing Active learning via Daily/weekly active Traditional Automated batch retraining feedback Scientific Method: ‘challenger’ Test a Hypothesis online evaluation & deployment Hand-crafted machine learned Automated Hand Real-time online models ensemble, boosting Crafted learning via passive & feature selection Rules feedback techniques
2. You rarely get to deploy the same model twice
REUSING MODELS IS A REPUTATION HAZARD Model Model’s Goal Sample size Context LACE index 30-day mortality or 11 hospitals in Ontario, 4,812 (2010) readmission 2002-2006 Charlson morbidity index 1 hospital in NYC, 1-year mortality 607 (1987) April 1984 Elixhauser morbidity index Hospital charges, length of 438 hospitals in CA, 1,779,167 (1998) stay & in-hospital mortality 1992 Cotter PE, Bhalla VK, Wallis SJ, Biram RW. Predicting readmissions: Poor performance of the LACE index in an older UK population . Age Ageing . 2012 Nov;41(6):784-9.
DON’T ASSUME YOU’RE READY FOR YOUR NEXT CUSTOMER Healthcare / Natural Language Cyber Security / Deep Learning Clinical coding for outpatient radiology Detect malicious URL’s • • Infer procedure code (CPT), 90% overlap Train on one dataset, test on others • • 1 0.8 0.6 0.4 0.2 0 Precision Recall Specialized Non-Specialized
IT’S NOT ABOUT HOW ACCURATE YOUR MODEL IS (IT’S ABOUT HOW FAST YOU CAN TUNE IT ON MY DATA) [D. Sculley et al., Google, NIPS 2015]
3. It’s really hard to know how well you’re doing
HOW OPTIMIZELY (ALMOST) GOT ME FIRED [Peter Borden, SumAll, June 2014] “it seemed we were only seeing about 10%-15% of the predicted lift, so we decided to run a little experiment. And that’s when the wheels totally flew off the bus.” 14
THE PITFALLS OF A/B TESTING [Alice Zheng, Dato, June 2015] How many false positives What does the p-value separation of experiences can we tolerate? mean? How many observations Multiple models, Which metric? do we need? multiple hypotheses How much change counts Is the distribution of the How long to run the test? as real change? metric Gaussian? One- or two-sided test? Are the variances equal? Catching distribution drift 15
FIVE PUZZLING OUTCOMES EXPLAINED [Ron Kohavi et al., Microsoft, August 2012] The Primacy and Novelty Effects Best Practice: Regression to the Mean A/A Testing
4. Often, the real modeling work only starts in production
SEMI SUPERVISED LEARNING
IN NUMBERS 50+ 99.9999% 6+ Schemes Months ‘Good’ messages per case (and counting)
ADVERSARIAL LEARNING
5. Your best people are needed on the project after going to production
SOFTWARE DEVELOPMENT DESIGN BUILD & TEST DEPLOY OPERATE First Ongoing, repetitive Riskiest & most Most important, deployment is tasks are either reused code hardest to change hands-on , then automated away components are technical decisions we automate it or handed off to built and tested are made here. and iterate to support & first. build lower- operations. priority features.
MODEL DEVELOPMENT DEPLOY & AUTOMATE MODEL EXPERIMENT MEASURE Feature engineering, Design & run as Automate the retrain model selection & many experiments , or active learning Online metrics is optimization are done as fast as possible, pipeline , including key in production, for the 1 st model built. with new inputs, online metrics & since results will features & labeled data often defer from feedback. collection. off-line ones.
To conclude…
MODEL DEVELOPMENT ≠ SOFTWARE DEVELOPMENT Rethink your development process Set the right expectations with your customers Deploy a platform & plan for the DataOps effort in production
THANK YOU! david@pacific.ai @davidtalby in/davidtalby
Recommend
More recommend