Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Acc ccele elera ratin ting g Ma Mach chin ine e Lea earnin rning g wit with h Tra rain inin ing g Data Data Ma Mana nage gement ment Alex x Ratne tner Stanford University 1
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Training data is the key ingredient in ML But it’s created and managed in manual, ad hoc ways 2
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management KEY RESEARCH QUESTION Can we add mathematical & systems structure to the way people build & manage training sets today? 3
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Running Example: Chest X-Ray Triage “Abnormal” Motivation: Case prioritization for e.g. low- resource hospitals 4 [Dunnmon et. al., Radiology 2018; Dunnmon & Ratner et. al., 2019; Khandewala et. al., NeurIPS ML4H 2017]
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Running Example: Chest X-Ray Triage Unlabeled data Training set Model Model (multi-modal) creation development (e.g. ResNet) 2-3 days ± 1 point due to model choice Model dev is often radically easier today! 5 (All scores: ROC AUC) [Dunnmon et. al., Radiology 2018; Dunnmon & Ratner et. al., 2019; Khandewala et. al., NeurIPS ML4H 2017]
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Running Example: Chest X-Ray Triage Unlabeled data Training set Model Model (multi-modal) creation development (e.g. ResNet) 8 months 2- 3 days ± 9 points due to training set size ± 1 point due to model choice ± 8 points due to training set quality Training data is often the key differentiator 6 (All scores: ROC AUC) [Dunnmon et. al., Radiology 2018; Dunnmon & Ratner et. al., 2019; Khandewala et. al., NeurIPS ML4H 2017]
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Challenges of Training Data Management • Vol olume is is crit itic ical al • But training ining data ata is lar argel gely y hand nd-labele abeled: : slow & e expensiv xpensive • Qualit lity is is c crit itic ical • But this s is chall llenging ging to to assess assess 𝑍 ∈ {“𝐵𝑐𝑜𝑝𝑠𝑛𝑏𝑚”, “𝑂𝑝𝑠𝑛𝑏𝑚”} • Fle lexi xibi bilit lity is is c crit itic ical al 𝑍 ∈ {“𝑉𝑠𝑓𝑜𝑢”, “𝐹𝑛𝑓𝑠𝑓𝑜𝑢”, “𝑂𝑝𝑠𝑛𝑏𝑚”} • But training ining sets ts are e comp mpletel ely y stati atic 7
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Our research: building systems that Let users specify training sets in 1 higher-level, programmatic ways 2 Clean and integrate this input Use as training data for ML models 3 A new way to specify ML models--- in hours rather than months 8
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Our Research: Training Data Management Systems Multi-Task Labeling Augmentation Supervision Unlabeled data Model This talk: Three systems that support and accelerate critical steps of training data creation & management
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Our Research: Training Data Management Systems 1 Multi-Task Labeling Augmentation Supervision Unlabeled Normal data Model Snorkel Programmatically label training data
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Our Research: Training Data Management Systems 1 2 Multi-Task Labeling Augmentation Supervision Unlabeled Normal data Model Snorkel TANDA Programmatically transform training data
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Our Research: Training Data Management Systems 1 2 3 Multi-Task Labeling Augmentation Supervision Unlabeled Normal 𝑍 𝑍 𝑍 data 1 2 3 Model Snorkel TANDA MeTaL Programmatically integrate training data across multiple tasks
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Our Research: Training Data Management Systems 1 2 3 Multi-Task Labeling Augmentation Supervision Unlabeled Normal 𝑍 𝑍 𝑍 data 1 2 3 Model Snorkel TANDA MeTaL Deployments: Industry Government Medicine
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Our Research: Training Data Management Systems 1 2 3 Multi-Task Labeling Augmentation Supervision Unlabeled Normal data Model 𝑍 𝑍 𝑍 1 2 3 Snorkel TANDA MeTaL
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Problem: Hand-labeling is slow, expensive, & static Idea: Enable users to label training data programmatically 15
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management KEY TECHNICAL IDEA: View training set labeling as a noisy programmatic process that we can model 16
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management The Snorkel Pipeline snorkel.stanford.edu def LF_short_report(x): 𝑍 if len(X.words) < 15: 1 return “NORMAL” def LF_off_shelf_classifier(x): 𝑍 if off_shelf_classifier(x) == 1: 2 return “NORMAL” 𝑍 def LF_pneumo(x): 𝑍 if re.search( r’pneumo.*’ , X.text): 3 return “ABNORMAL” def LF_ontology(x): 𝑍 if DISEASES & X.words: 4 TRAINING return “ABNORMAL” DATABASE LABELING FUNCTIONS LABEL MODEL UNLABELED DATA END MODEL Users write Snorkel The resulting labeling functions cleans and training database to heuristically combines the used to train an label data LF labels ML model Note: No hand-labeled training data! 17
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Snorkel: Real-World Deployments Science & Industry Government snorkel.stanford.edu Medicine In many cases: From person-months of hand- labeling to hours 18
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management (1) Writing Labeling Functions 1 def LF_short_report(x): 𝑍 if len(X.words) < 15: 1 return “NORMAL” def LF_off_shelf_classifier(x): 𝑍 if off_shelf_classifier(x) == 1: 2 return “NORMAL” 𝑍 def LF_pneumo(x): 𝑍 if re.search( r’pneumo.*’ , X.text): 3 return “ABNORMAL” def LF_ontology(x): 𝑍 if DISEASES & X.words: 4 TRAINING return “ABNORMAL” DATABASE LABELING FUNCTIONS LABEL MODEL UNLABELED DATA END MODEL Users write Snorkel The resulting labeling functions cleans and training database to heuristically combines the used to train an label data LF labels ML model 19
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management (1) Writing Labeling Functions def LF_short_report(x): if len(X.words) < 15: return “NORMAL” Labeling function: def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL” 𝜇: 𝒴 ↦ 𝒵 ∪ {0} def LF_pneumo(x): if re.search( r’pneumo.*’ , X.text): return “ABNORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” Data Labels Abstain LABELING FUNCTIONS A simple abstraction for expressing domain heuristics or other noisy label sources 20
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Simple Example: Pattern Matching “Indication: Chest pain. Findings: Focal consolidation def LF_pneumo(x): if re.search( r’pneumo.*’ , X.text): and pneumothorax. ” return “ABNORMAL” Labeling beling functio nctions ns (LFs) s) are e si simple ple UDF DFs s for r expr pressing essing domain main exper pertise tise 21
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Simple Example: Pattern Matching “Indication: Chest pain. Findings: No focal def LF_pneumo(x): if re.search( r’pneumo.*’ , X.text): consolidation or return “ABNORMAL” pneumothorax …” LFs s can n also so be noisy sy--- --we can n est stima imate e their eir accuracies ccuracies to to handle ndle this s (next) ) 22
Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management A Simple Formalism for Weak Supervision Strategies def LF_pneumo(x): • Pattern matching [e.g. Hearst 1992, if re.search( r’pneumo.*’ , X.text): Zhang 2017] return “ABNORMAL” def LF_ontology(x): • Distant supervision [e.g. Mintz 2009] if DISEASES & X.words: return “ABNORMAL” def LF_short_report(x): • Domain heuristics if len(X.words) < 15: return “NORMAL” def LF_circular_mass(x): • Functions of features [e.g. Varma 2017] c = off_shelf_circle_finder(x)[0] if c.radius > 1: return “ABNORMAL” An And many ny others ers: : crowdsour dsourcing, ing, other her models, dels, etc. tc. 23
Recommend
More recommend