Understanding and Visualizing Data Iteration in Machine Learning Fred Hohman , Georgia T ech, @fredhohman Kanit Wongsuphasawat , Apple, @kanitw Mary Beth Kery , Carnegie Mellon University, @mbkery Kayur Patel , Apple, @foil
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST Test accuracy: 0.81
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? Test accuracy: 0.81
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture Test accuracy: 0.81
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81 C. Adjust loss function
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81 C. Adjust loss function D. Get an ML PhD
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81 C. Adjust loss function D. Get an ML PhD Add more data!
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Add 10x data Test accuracy: 0.99 ✓ Test accuracy: 0.81
6 https://twitter.com/karpathy/status/1231378194948706306
“clean and grow the training set” “repeated this process until accuracy improved enough” 7 https://twitter.com/karpathy/status/1231378194948706306
“clean and grow the training set” “repeated this process until accuracy improved enough” This is a data iteration ! 8 https://twitter.com/karpathy/status/1231378194948706306
Data Iteration Model Iteration World Data Model
Data Iteration Model Iteration World Data Model
Data Iteration Model Iteration World Data Model
Understanding and Visualization Data Iteration Contributions • Identify common data iterations and challenges through practitioner interviews at Apple • C HAMELEON : Interactive visualization for data iteration • Case studies on real datasets
Interviews to Understand Data Iteration Practice Participant Information # of Domain Specialization people • Semi-structured interviews with ML Large-scale classification, researchers, engineers, and managers Computer vision object detection, video 8 at Apple analysis, visual search • 23 practitioners across 13 teams Text classification, question Natural language answering, language 8 processing understanding Platform and infrastructure, Applied ML + crowdsourcing, annotation, 5 Systems deployment Senors Activity recognition 1
“Most of the time, we improve performance more by adding additional data or cleaning data rather than changing the model [code].” — Applied ML practitioner in computer vision
Interviews to Understand Data Iteration Practice Findings Summary Why do Data Iteration? Data Iteration Frequency Entangled Iterations • Models: monthly → daily • Data improves performance • Separate model and data iterations to ensure fair • Data: monthly → per minute • Data boostraps modeling comparisons • The world changes, so must your data
Gather more data randomly sampled from population Interviews to Understand Data Iteration Practice Common Data Iterations + Add sampled instances + Add labels Gather more data randomly sampled from population Add and enrich instance annotations + Add specific instances - Remove instances Gather more data intentionally for specific label or feature range Remove noisy and erroneous outliers + Add synthetic instances ~ Modify features, labels Gather more data by creating synthetic data or augmenting existing data Clean, edit, or fix data
Interviews to Understand Data Iteration Practice Challenges of Data Iteration • Tracking experimental and iteration history • When to “unfreeze” data versions • When to stop collecting data • Manual failure case analysis • Building data blacklists
C HAMELEON Understanding and Visualization Data Iteration • Retroactively track and explore data iterations and metrics over versions • Attribute model metric change to data iterations • Understand model sensitivity over data versions
C HAMELEON Understanding and Visualization Data Iteration Compare feature distributions by: • Training and testing splits Count Correct • Performance (e.g., correct v. incorrect predictions) Binned feature • Data versions Incorrect “Overlaid diverging histogram” per feature
C HAMELEON Visualizations Aggregated Embedding Prediction Change Matrix Sensitivity Histogram correct instances: 154 incorrect instances: 60 total instances: 214 accuracy: 0.720
Demo
Demo
Case Study I Sensor Prediction • Visualization challenges prior data collection beliefs • 64,502 instances • Finding failure cases • Collected over 2 months • 20 features • Interface utility �0,000 400 2,000 4,000 �,000 �,000 200 2,000 �,000 0 0 0 0 0 1.1 1.5 1.9 1.13 1.18 A feature’s long-tailed, multi-modal distribution shape solidifies over collection time: 1,442 → 64,205 instances
Case Study II Learning from Logs • Inspecting performance across features • Capturing data processing changes correct instances: 125 incorrect instances: 12 • Encouraging instance-level analysis total instances: 137 accuracy: 0.912 Filter Filter • 48,000 instances • Collected over 6 months • 34 features Filtering across features quickly finds data subsets to compare against global distributions
Opportunities for Future ML Iteration Tools
Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration
Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff
Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff • Data as a shared connection across user expertise
Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff • Data as a shared connection across user expertise • Visualizing probabilistic labels from data programming
Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff • Data as a shared connection across user expertise • Visualizing probabilistic labels from data programming • Visualizations for other data types
Understanding and Visualizing Data Iteration in Machine Learning Fred Hohman , Georgia T ech, @fredhohman Kanit Wongsuphasawat , Apple, @kanitw fredhohman.com/papers/chameleon Mary Beth Kery , Carnegie Mellon University, @mbkery Kayur Patel , Apple, @foil
Recommend
More recommend