The need for clinical (and trialist) commonsense in AI algorithm - PowerPoint PPT Presentation

The need for clinical (and trialist) commonsense in AI algorithm design Samuel Finlayson MD-PhD Candidate, Harvard-MIT

We’re all really excited about machine learning, and we should be. Source: eyediagnosis.net en.wikipedia.org/wiki/File:ImageNet_error_rate_history_(just_systems).svg

For the all excitement, clinical benefit of AI is still largely hypothetical • Very few prospective trials of medical AI have been reported in any specialty • Per Eric Topol Review, only 4 as of 1/2019 • Good news: 2 of 4 were in Ophthalmology! • Many models struggle to reproduce findings in new patient populations • No trials, to my knowledge, have demonstrated improved clinical outcomes CC-Cruiser: 98.87% accuracy in small trial 1 87.4% (vs physician 99.1%) in trial 2

Goal for this tutorial: Equip attendees to identify co common pit pitfalls alls in medical AI that make in informed d clinic linical al expe perts essential to development and deployment

Review: The ML development pipeline Decision- Clinical Model Model Model Model making Training Integration Design Evaluation Deployment Impact Train/Val Data Test Data and Labels and Labels Retrospective Prospective Data Data

What do we need clinical experts to be asking?

Key questions to ask during da datas aset cur curatio ion Decision- Clinical Model Model Model Model making Training Integration Design Evaluation Deployment Impact Train/Val Data Test Data and Labels and Labels Retrospective Prospective Data Data

How might our model be tainted with in infor ormation tion fr from m the he fut futur ure ? Hypothetical example #1: • Plan: train a ML algorithm to detect DR • Postdoc downloads all fundus images from your clinical database, using discharge diagnoses to gather DR cases and healthy controls. What could go wrong? (Hint: see figure) Source: endotext.com

How might our model be tainted with in infor ormation tion fr from m the he fut futur ure ? Answer: • Laser scars are present! • Model may learn to “diagnose” the treatment instead of the disease. • This is one example of label leakage , a very common problem. Source: endotext.com

How might our te test set be contaminated with information from our tr train ainin ing se set ? Hypothetical example #1 (con’t): • Postdoc tries again, limiting images to exams prior to treatment. • All case and control images split randomly into a train and test set Training Image 1 Test Image 1 What could go wrong? (Hint: see figure) Image source: wikipedia

How might our te test set be contaminated with information from our tr train ainin ing se set ? Answer: • Images from the same patients are in both train and test sets! • Test set metrics will overestimate model accuracy, providing limited evidence for accuracy on unseen patients • This is one example of train-test set leakage. Training Image 1 Test Image 1 Image source: wikipedia

How might our model by co confounded ? Hypothetical example (#2): • You build an ML classifier to detect optic disk edema for neurologic screening. • Images are gathered from the ED and the outpatient clinic with no regard to their site of origin. How could this data acquisition process lead to confounding ?

How might our model by co confounded ? ( One) Answer: Matching on Matching on No Matching Patient Patient + • Imaging models have been features Healthcare process shown to depend on “non- imaging” variables • In ophthalmology , we know that age, sex, etc. trivially predicted by models from images. • Problem very acute with drug, billing, text data Source: Badgeley et al, 2018

Key questions to ask during mo model ev evaluation Decision- Clinical Model Model Model Model making Training Integration Design Evaluation Deployment Impact Train/Val Data Test Data and Labels and Labels Retrospective Prospective Data Data

Is our model performance consistent across patient subpopulations? Model Error vs Race Hypothetical example #3: • At the request of reviewer #2, your team evaluates its model performance stratified by race, finding large differences. (See plot on right). • You gather more cases from underrepresented groups and retrain the model, but it doesn’t improve the situation. What could be happening? Source: Chen et al, NeurIPS ‘18

Is our model performance consistent across subpopulations? Model Error vs Race Answer: • All model bias is not created equal • Different biases require different solutions • Could require: More data , more features , or different models . • See the brilliant Chen et al, NeurIPS 2018 Source: Chen et al, NeurIPS ‘18

Key questions to ask during mo model deployme ment Decision- Clinical Model Model Model Model making Training Integration Design Evaluation Deployment Impact Train/Val Data Test Data and Labels and Labels Retrospective Prospective Data Data

How might the data we feed our model change ov over time time ? Hypothetical example #4: • Your highly accurate ML tool suddenly begins to fail several years after clinical deployment • IT team insists the model has not changed. What might be going on?

How might the data we feed our model change ov over time time ? Answer: • Clinical performance is not fixed! • Changes in the input data can disrupt model performance: dataset shift • Model evaluation and development must be an ongoing New EHR System Installed Source: Nestor et al, 2018

Key questions to ask as we assess mo model imp mpact Decision- Clinical Model Model Model Model making Training Integration Design Evaluation Deployment Impact Train/Val Data Test Data and Labels and Labels Retrospective Prospective Data Data

Can we anticipate any unintended consequences? Diagnosis does not equal outcomes! Mismatched incentives -> adversarial behavior Welch, 2017 Finlayson et al, 2019

Conclusions • Many of the most pernicious challenges of medical machine learning are study design problems • What sources of leakage , bias and confounding might be baked into the design? • How does the target population compare with the study population? • How might populations evolve over time , and how should they be monitored ? • Can we anticipate any unintended consequences of deployment? • Clinicians and clinical researchers (trialists, epidemiologists, biostatisticians) have been asking similar questions for decades • Delivering on the promise of medical ML requires a true partnership between clinical research and machine learning expertise

Thank you Invitation to speak: Michael Abramoff Feedback on presentation: Lab team of Isaac Kohane, DBMI at Harvard

The need for clinical (and trialist) commonsense in AI algorithm - PowerPoint PPT Presentation

The need for clinical (and trialist) commonsense in AI algorithm design Samuel Finlayson MD-PhD Candidate, Harvard-MIT Were all really excited about machine learning, and we should be. Source: eyediagnosis.net

Commonsense benchmarks Or how to measure that your model is actually doing some commonsense

Commonsense Knowledge in Pre-trained Language Models Vered Shwartz July 5th, 2020 Commonsense

Which Material Design Is Commonsense . . . Possible Under Additive Commonsense . . . How

Representing Knowledge Dustin Smith MIT Media Lab July 2008 Commonsense Computing MIT MediaLab

Agenda 08:00 PST 1 hr 50 mins Part I - Review of CSKGs 15 min Introduction to commonsense

Agenda 08:00 PST 1 hr 50 mins Part I - Review of CSKGs 15 min Introduction to commonsense

WebChild: Harvesting and Organizing Commonsense Knowledge from Web Niket Tandon Max Planck

Commonsense Properties from Query Logs and Question Answering Forums Julien Romero, Simon

You Won't Believe This! commonsense.org/education Shareable with attribution for noncommercial

Commonsense for Generative Multi-Hop Question Answering Tasks Lisa Bauer* Yicheng Wang* Mohit

Our Digital Citizenship Pledge commonsense.org/education Shareable with attribution for

This Just In! commonsense.org/education Shareable with attribution for noncommercial use. Remixing

Acquiring Comparative Commonsense Knowledge from the Web Niket Tandon Max Planck Institute for

PIQA: Reasoning about Physical Commonsense in Natural Language Shailesh M Pandey Bisk, Yonatan

How to give credit: Author Title Website commonsense.org/education Shareable with attribution

STOP to Online Meanness commonsense.org/education Shareable with attribution for noncommercial

Learning based MR Imaging Systems - Deployment Challenges Magnetic Resonance (MR) imaging - image

Brain Computer Interfaces for Full Body Movement and Embodiment Intelligent Robotics Seminar

FLST: Cognitive Foundations Francesca Delogu delogu@coli.uni-saarland.de

Discovering and Teaching the Grammar of Academic Writing A workshop in Systemic Functional

A benchmark preview of liver vessel enhancement algorithms Jonas Lamy , Odysse Merveille,

Weka machine learning algorithms in Stata Alexander Zlotnik, PhD Technical University of Madrid

Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive

FungalWeb A Semantic Web for Exploring Knowledge-Based Bioinformatics Greg Butler Volker

Sambuz

Useful Links

Newsletter

Mail Us