EXECUTIVE BRIEFING: WHY MACHINE LEARNED MODELS CRASH AND BURN IN - PowerPoint PPT Presentation

EXECUTIVE BRIEFING: WHY MACHINE LEARNED MODELS CRASH AND BURN IN PRODUCTION (AND WHAT TO DO ABOUT IT) Dr. David Talby

MODEL DEVELOPMENT ≠ SOFTWARE DEVELOPMENT

1. The moment you put a model in production, it starts degrading

GARBABE IN, GARBAGE OUT [Sanders & Saxe, Sophos Group, Proceedings of Blackhat 2017] “The greatest model, trained on data inconsistent with the data it actually faces in the real world, will at best perform unreliably, and at worst fail catastrophically.” 4

CONCEPT DRIFT: AN EXAMPLE • Locality (epidemics) Medical claims > 4.7 Billion • Seasonality Pharmacy claims > 1.2 Billion • Changes in the hospital / population Providers > 500,000 • Impact of deploying the system Patients > 120 million • Combination of all of the above

[D. Sculley et al., Google, NIPS 2015] 6

HOW FAST DEPENDS ON THE PROBLEM (MUCH MORE THAN ON YOUR ALGORITHM) Always Changing Never Changing Cyber Google or Political & Security Amazon Economic Search Natural Banking & Models models Language, Social eCommerce Behavior Models fraud Physical models: Face recognition Voice recognition Online Social Climate models Automated trading Networking Real-time ad bidding Models/Rules

SO PUT THE RIGHT PLATFORM IN PLACE (MEASURE, RETRAIN, REDEPLOY) Always Changing Never Changing Active learning via Daily/weekly active Traditional Automated batch retraining feedback Scientific Method: ‘challenger’ Test a Hypothesis online evaluation & deployment Hand-crafted machine learned Automated Hand Real-time online models ensemble, boosting Crafted learning via passive & feature selection Rules feedback techniques

2. You rarely get to deploy the same model twice

REUSING MODELS IS A REPUTATION HAZARD Model Model’s Goal Sample size Context LACE index 30-day mortality or 11 hospitals in Ontario, 4,812 (2010) readmission 2002-2006 Charlson morbidity index 1 hospital in NYC, 1-year mortality 607 (1987) April 1984 Elixhauser morbidity index Hospital charges, length of 438 hospitals in CA, 1,779,167 (1998) stay & in-hospital mortality 1992 Cotter PE, Bhalla VK, Wallis SJ, Biram RW. Predicting readmissions: Poor performance of the LACE index in an older UK population . Age Ageing . 2012 Nov;41(6):784-9.

DON’T ASSUME YOU’RE READY FOR YOUR NEXT CUSTOMER Healthcare / Natural Language Cyber Security / Deep Learning Clinical coding for outpatient radiology Detect malicious URL’s • • Infer procedure code (CPT), 90% overlap Train on one dataset, test on others • • 1 0.8 0.6 0.4 0.2 0 Precision Recall Specialized Non-Specialized

IT’S NOT ABOUT HOW ACCURATE YOUR MODEL IS (IT’S ABOUT HOW FAST YOU CAN TUNE IT ON MY DATA) [D. Sculley et al., Google, NIPS 2015]

3. It’s really hard to know how well you’re doing

HOW OPTIMIZELY (ALMOST) GOT ME FIRED [Peter Borden, SumAll, June 2014] “it seemed we were only seeing about 10%-15% of the predicted lift, so we decided to run a little experiment. And that’s when the wheels totally flew off the bus.” 14

THE PITFALLS OF A/B TESTING [Alice Zheng, Dato, June 2015] How many false positives What does the p-value separation of experiences can we tolerate? mean? How many observations Multiple models, Which metric? do we need? multiple hypotheses How much change counts Is the distribution of the How long to run the test? as real change? metric Gaussian? One- or two-sided test? Are the variances equal? Catching distribution drift 15

FIVE PUZZLING OUTCOMES EXPLAINED [Ron Kohavi et al., Microsoft, August 2012] The Primacy and Novelty Effects Best Practice: Regression to the Mean A/A Testing

4. Often, the real modeling work only starts in production

SEMI SUPERVISED LEARNING

IN NUMBERS 50+ 99.9999% 6+ Schemes Months ‘Good’ messages per case (and counting)

ADVERSARIAL LEARNING

5. Your best people are needed on the project after going to production

SOFTWARE DEVELOPMENT DESIGN BUILD & TEST DEPLOY OPERATE First Ongoing, repetitive Riskiest & most Most important, deployment is tasks are either reused code hardest to change hands-on , then automated away components are technical decisions we automate it or handed off to built and tested are made here. and iterate to support & first. build lower- operations. priority features.

MODEL DEVELOPMENT DEPLOY & AUTOMATE MODEL EXPERIMENT MEASURE Feature engineering, Design & run as Automate the retrain model selection & many experiments , or active learning Online metrics is optimization are done as fast as possible, pipeline , including key in production, for the 1 st model built. with new inputs, online metrics & since results will features & labeled data often defer from feedback. collection. off-line ones.

To conclude…

MODEL DEVELOPMENT ≠ SOFTWARE DEVELOPMENT Rethink your development process Set the right expectations with your customers Deploy a platform & plan for the DataOps effort in production

THANK YOU! david@pacific.ai @davidtalby in/davidtalby

EXECUTIVE BRIEFING: WHY MACHINE LEARNED MODELS CRASH AND BURN IN - PowerPoint PPT Presentation

EXECUTIVE BRIEFING: WHY MACHINE LEARNED MODELS CRASH AND BURN IN PRODUCTION (AND WHAT TO DO ABOUT IT) Dr. David Talby MODEL DEVELOPMENT SOFTWARE DEVELOPMENT 1. The moment you put a model in production, it starts degrading GARBABE IN,

PUEBLO MS2 - CRASH http://pueblo.ms2soft.com/ By: Hannah Haunert TCDS Traffic Crash Location

Cool Cisco IOS Commands: test crash test crash test crash is an undocumented Cisco IOS command

Crash Preventability Determination Program 1 Request and Review Process 2 Eligible Crash Types

Arizona Crash Report Presentation by Glen Robison State Custodian of Crash Records Prepared

Briefing on Interim F Briefing on Interim F inancial Results inancial Results Briefing on

Crash and Burn: Learning from Failure SOA 2020 June 17, 2020 Crash and Burn Collette N.

MATLAB crash course Cesar E. Tamayo Economics - Rutgers September 27th, 2013 1/27 MATLAB crash

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Taint Nobody Got Time for Crash Analysis Crash Analysis Triage Goals Execution Path What

Utilization of Crash and Medical Data to Reduce Motor Vehicle Crash Severity Funded with NHTSA

Briefing Notes The Briefing Notes Page The Briefing Notes include: An introduction to the

DSGE Models: A User Guide for Policymakers Lawrence J. Christiano Outline Why models? Why

CRASH COURSE OR COURSE CRASH: Gaming, VR and a Pedagogical Approach Dr. Brent Chamberlain

Specifying and Checking File System Crash-Consistency Models Steven Lang September 4, 2016

Crash Course into the New Finnish Government and HQ Communication Crash Course into the New

2010 Large Truck and Bus Crash Data: An Overview Webinar June 21, 2012 Ralph Craft, Ph.D.

METERING SKIDS METERING SKIDS YEAR 2012 YEAR 2012 TOBRUK Client : J&P Owner : LASMO

CBCs Digital Infrastructure Evolving to meet the needs of an Agile Council April 2016 Central

2017 4Q Results Presentation Athens, 22 February 2017 CONTENTS Executive Summary

Normal Of Business Operations Presentation to Association of Corporate Counsel Chicago Chapter

Airships to the Arctic V Calgary October 8, 2009 October 6, 2009 1 2 3 4 5 6 7 8 9 10

Control Strategy EMA, London; 23 November 2017 1 EMA Prior Knowledge Workshop Case Study:

CORPORATE PRESENTATION JUNE 2016 INDIAN UNCONVENTIONAL ONSHORE GAS REVOLUTION www.oilex.com.au

Final Presentation Province Lake 2014 Annual Meeting Province Lake Golf Club July 19, 2014

Sambuz

Useful Links

Newsletter

Mail Us

EXECUTIVE BRIEFING: WHY MACHINE LEARNED MODELS CRASH AND BURN IN - PowerPoint PPT Presentation

EXECUTIVE BRIEFING: WHY MACHINE LEARNED MODELS CRASH AND BURN IN PRODUCTION (AND WHAT TO DO ABOUT IT) Dr. David Talby MODEL DEVELOPMENT SOFTWARE DEVELOPMENT 1. The moment you put a model in production, it starts degrading GARBABE IN,

PUEBLO MS2 - CRASH http://pueblo.ms2soft.com/ By: Hannah Haunert TCDS Traffic Crash Location

Cool Cisco IOS Commands: test crash test crash test crash is an undocumented Cisco IOS command

Crash Preventability Determination Program 1 Request and Review Process 2 Eligible Crash Types

Arizona Crash Report Presentation by Glen Robison State Custodian of Crash Records Prepared

Briefing on Interim F Briefing on Interim F inancial Results inancial Results Briefing on

Crash and Burn: Learning from Failure SOA 2020 June 17, 2020 Crash and Burn Collette N.

MATLAB crash course Cesar E. Tamayo Economics - Rutgers September 27th, 2013 1/27 MATLAB crash

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Taint Nobody Got Time for Crash Analysis Crash Analysis Triage Goals Execution Path What

Utilization of Crash and Medical Data to Reduce Motor Vehicle Crash Severity Funded with NHTSA

Briefing Notes The Briefing Notes Page The Briefing Notes include: An introduction to the

DSGE Models: A User Guide for Policymakers Lawrence J. Christiano Outline Why models? Why

CRASH COURSE OR COURSE CRASH: Gaming, VR and a Pedagogical Approach Dr. Brent Chamberlain

Specifying and Checking File System Crash-Consistency Models Steven Lang September 4, 2016

Crash Course into the New Finnish Government and HQ Communication Crash Course into the New

2010 Large Truck and Bus Crash Data: An Overview Webinar June 21, 2012 Ralph Craft, Ph.D.

METERING SKIDS METERING SKIDS YEAR 2012 YEAR 2012 TOBRUK Client : J&amp;P Owner : LASMO

CBCs Digital Infrastructure Evolving to meet the needs of an Agile Council April 2016 Central

2017 4Q Results Presentation Athens, 22 February 2017 CONTENTS Executive Summary

Normal Of Business Operations Presentation to Association of Corporate Counsel Chicago Chapter

Airships to the Arctic V Calgary October 8, 2009 October 6, 2009 1 2 3 4 5 6 7 8 9 10

Control Strategy EMA, London; 23 November 2017 1 EMA Prior Knowledge Workshop Case Study:

CORPORATE PRESENTATION JUNE 2016 INDIAN UNCONVENTIONAL ONSHORE GAS REVOLUTION www.oilex.com.au

Final Presentation Province Lake 2014 Annual Meeting Province Lake Golf Club July 19, 2014

Sambuz

Useful Links

Newsletter

Mail Us

METERING SKIDS METERING SKIDS YEAR 2012 YEAR 2012 TOBRUK Client : J&P Owner : LASMO