ml alice was ey eycited lots of tutorials loads of
play

ML Alice was ey eycited! Lots of tutorials Loads of resources ML - PowerPoint PPT Presentation

ML Alice was ey eycited! Lots of tutorials Loads of resources ML Endless ey eyamples Fast paced research How to even data science? How to even data science? https://miro.medium.com/max/1552/1*Nv2NNALuokZEcV6hYEHdGA.png Challenge How


  1. ML

  2. Alice was ey eycited! Lots of tutorials Loads of resources ML Endless ey eyamples Fast paced research

  3. How to even data science?

  4. How to even data science? https://miro.medium.com/max/1552/1*Nv2NNALuokZEcV6hYEHdGA.png

  5. Challenge How to make this work in the real world?

  6. Machine Learning’s Surprises A Checklist for Developers when Building ML Systems

  7. Hi, I’m Jade Abbott @alienelf masakhane.io

  8. Hi, I’m Jade Abbott

  9. Surprises while... Trying to deploy Trying to improve the model the model Afuer deployment of model

  10. Some context ❖ I won’t be talking about training machine learning models ❖ I won’t be talking about which models to chose ❖ I work primarily in deep learning & NLP ❖ I am a one person ML team working in a staruup context ❖ I work in a normal world where data is scarce and we need to collect more

  11. Ti Tie Problem Yes, they should I want to meet... meet No they I can provide... shouldn’t Embedding + LSTM + Downstream NN

  12. Ti Tie Problem Yes, they should I want to meet... meet someone to look after my cat No they I can provide... shouldn’t pet sitting cat breeding software development Language Model + Downstream Task chef lessons

  13.  Ti Tie Problem Yes, they should I want to meet... meet someone to look after my cat No they I can provide... shouldn’t pet sitting cat breeding software development chef lessons

  14. Ti Tie Problem Yes, they should I want to meet... meet someone to look after my cat The Model No they I can provide... shouldn’t pet sitting cat breeding software development chef lessons

  15. Surprises Surprises trying to deploy the model

  16. Ey Eypectations train & evaluate model CI/CD API model Unit Tests user testing

  17. Surprise #1 Is the model good enough?

  18. 75% Accuracy

  19. Pergormance Metrics Business needs to understand it ❖ Active discussion about pros & cons ❖ Get sign ofg ❖ Threshold selection strategy ❖

  20. Surprise #2 Can we trust it?

  21. Husky/Dog Classifjer Skin Cancer Detection 1. https://visualsonline.cancer.gov/details.cfm?imageid=9288 2. htups://arxiv.org/pdf/1602.04938.pdf

  22. Husky/Dog Classifjer Skin Cancer Detection 1. https://visualsonline.cancer.gov/details.cfm?imageid=9288 2. htups://arxiv.org/pdf/1602.04938.pdf

  23. Explanations

  24. htups://github.com/marcotcr/lime https://pair-code.github.io/what-if-tool/

  25. Surprise #3 Will this model harm users?

  26. “ Racial bias in a medical algorithm favors white patients over sicker black patients” Washington Pot ott

  27. “ Racist robots, as I invoke them here, represent a much broader process: social bias embedded in technical aruifacts, the allure of objectivity without public accountability” ~ Ruha Benjamin @ruha9

  28. “What are the unintended consequences of designing systems at scale on the basis of existing patuerns of society?” ~ M.C. Eilish & Danah Boyd, Don’t Believe Every AI You See @m_c_elish @zephoria

  29. ❖ Word2Vec has known gender and race biases ❖ It’s in English ❖ Is it robust to spelling errors? ❖ How does it pergorm with malicious data?

  30. ❖ Word2Vec has known gender and race biases Make it measurable! ❖ It’s in English ❖ Is it robust to spelling errors? ❖ How does it pergorm with malicious data?

  31. htups://pair-code.github.io htup://aif360.mybluemix.net htups://github.com/fairlearn/fairlearn htups://github.com/jphall663/awesome-machine-learning-interpretability

  32. Ey Eypectations train & evaluate model CI/CD API model Unit Tests user testing

  33. Reality choose a useful metric Evaluate model model Choose threshold API Explain predictions Fairness Framework Unit user Tests testing

  34.  Surprises Surprises afuer deploying the model

  35. Ey Eypectations user drop ofg agile cycle Bug Triage bug tracking tool reproduce, debug, fjx, release user testing logs a bug or submits a complaint

  36. Surprise #5 I want to meet a doctor I can provide marijuana and other drugs which improves health

  37. Surprise #5 The model has some “bugs”

  38. Surprise #5 continued... ❖ What is a model “bug” ❖ How to fjx the bug? ❖ When is the “bug” fjxed? ❖ How do I ensure test regression? ❖ “Bug” priority?

  39. Surprise #5 I want to meet a doctor I can provide marijuana and other drugs which improves health

  40. Add to your test set Describing the “bugs” Prediction Target False Potitive I can provide marijuana and other drugs I want to meet a doctor YES NO which improves health True Negative I can provide marijuana I want to meet a doctor NO NO I can provide drugs for cancer patients I want to meet a doctor YES NO I can provide general practitioner I want to meet a doctor NO YES False Negative services I can provide medicine I want to meet a drug addiction sponsor YES YES True Potitives I can provide medicine I want to meet a pharmacist YES YES I can provide illegal drugs I want to meet a drug dealer YES NO

  41. Is my “bug” fiy fiyed? Classifjcation Error politicians-false-neg designers-too-general drugs-doctors-false-pos tech-too-general Candidate Model Over Time

  42. How do we triage these “bugs”?

  43. How do we triage these “bugs”? % Users Afgected x Normalized Error x Harm

  44. How do we triage these “bugs”? Problem Impact Error the-arus-too-general 2.931529 health-more-specific 1.53985 brand-marketing-social-media 1.285735 developer 1.054248 1-services 0.960129

  45. Surprise #6 Is this new model betuer than my old model?

  46. A lice replied, rather shyly, “I—I hardly know, sir, just at present—at least I know who I was when I got up this morning, but I think I must have changed several times since then.”

  47. Why is model comparison hard?

  48. Living Test Set 0.8 0.75

  49. Re-evaluate ALL models 0.72 0.75

  50. Surprise #7 I demoed the model yesterday and it went ofg-script! What changed?

  51. Surprise #7 Why is the model doing something difgerently today?

  52. What changed? ❖ My data? ❖ My model? ❖ My preprocessing?

  53. Experiment How to fi figure out what changed? Metadata Store Results Model Repository Repository experiment: 3 model-3 data: ea2541df code: da1341bb desc: “Added feature to training pipeline” CI/CD run_on: 10-10-2019 completed_on:11-10-2019 model: model-3 results: 3 ea2541df da1341bb Data Repository Code repository

  54. Ey Eypectations user drop ofg agile cycle Prioritization bug tracking tool reproduce, debug, fjx user testing logs a bug or submits a complaint

  55. Actual Add to Describe Calculate Identify model bug Triage user reporus problem Priority problem tracking bug with test tool patuerns “Agile Sprint” Pick Problem - Evaluate model against other models Retrain - Gather More - Evaluate individual Data for Problem problems - Change Model - Select model - Create Features

  56. Surprises Surprises maintaining and improving the model over time

  57. Ey Eypectation Generate/select Add to Get them Retrain unlabelled data set Pick an issue labelled patuerns

  58. Surprise #8 User behaviour drifus

  59. Now what? ● Regularly sample data from production for training ● Regularly refresh your test set

  60. Surprise #9 Data labellers are rarely experus

  61. Surprise #10 The model is not robust

  62. Surprise #10 The model knows when it’s unceruain

  63. Techniques for detecting robustness & uncertainty ❖ Sofumax predictions that are unceruain ❖ Dropout at Inference ❖ Add noise to data and see how much output changes

  64. Surprise #11 Changing and updating the data so ofuen gets messy

  65. Needed to check the following ● Data Leakage ● Duplicates ● Distributions

  66. Ey Eypectation Add to Get them Generate/select Retrain data set Pick an issue labelled unlabelled patuerns

  67. Reject Actual Review Get data labelled Pick Generate/select sample from Problem on crowdsourced unlabelled data each data platgorm labeller Approve Model tells you Escalate which patuerns confmicting it’s unceruain data labels about Data Version Data Version Experu data label Control Control CI/CD platgorm Runs tests on Add to branch of Merge into dataset New data! data dataset

  68. The Checklist Fjrst Release Careful metric selection Threshold selection strategy Explain Predictions Fairness Framework

  69. The Checklist Afuer Fjrst Release ML Problem Tracker Problem Triage Strategy Reproducible Training Comparable Results Result Management Be able to answer why

  70. The Checklist Long term improvements & maintenance Data refresh strategy Data Version Control CI/CD or Metrics for Data Data Labeller Platgorm + Strategy Robustness & Unceruainty

Recommend


More recommend