Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker
We are drowning in data Data Never Sleeps 2.0 - DOMO (2014) 4/5/16 GTC 2016 2
Ripe Opportunities • Many problems to solve • Limitless amounts of image data • Deep Learning pushing State of the Art everywhere • GPUs making everything possible 4/5/16 GTC 2016 3
The Problem • Deep Learning is data-driven • ImageNet has 1.2 million training examples • Few large, labeled image datasets exist • It’s expensive to label data • Our datasets are +100m images • Few qualified to label • Highly sensitive customer data • Necessary subject matter expertise 4/5/16 GTC 2016 4
Ever labeled data? • Not as easy as it seems: • It’s repetitive • Less accurate over time • One day computers will do it all for you? • But not yet • Can some of this effort be automated? 4/5/16 GTC 2016 5
Many Approaches • Mechanical Turk • Pre-trained classifiers • Costly • What if pre-trained • Time Consuming classifiers don’t work • Clustering well on data? • Expensive • Active learning • How many clusters • Iterative labeling • What features to • Open problem use? Can we combine these into something useful? 4/5/16 GTC 2016 6
The Goal • Inspired by Image Similarity experience and Jeremy Howard TED talk • Use machines to filter the noise • Reduce repetitive tasks • Leverage human labeler • Understand the data • Label iteratively • Allow exploration 4/5/16 GTC 2016 7
Our Approach 4/5/16 GTC 2016 8
Our Approach Compare Image Hashes to filter Duplicate Images 4/5/16 GTC 2016 9
Our Approach 4/5/16 GTC 2016 10
Our Approach 4/5/16 GTC 2016 11
Our Approach 4/5/16 GTC 2016 12
Our Approach Prevents over- focusing on one portion of feature space 4/5/16 GTC 2016 13
Our Approach 4/5/16 GTC 2016 14
Our Approach Label Images on the boundary of the class 4/5/16 GTC 2016 15
Our Approach Improve CNN features for labeled classes 4/5/16 GTC 2016 16
GUI 4/5/16 GTC 2016 17
Hardware • Cirrascale GB5670 • 56 CPU Cores • 8x NVIDIA K-80 • 512GB DDR4 • 1 TB SSD 4/5/16 GTC 2016 18
Benefits • Create Large, Labeled Datasets • High quality • Allows data exploration • Dramatic time reduction • ~3-5x faster initially • Multiplicative efficiency gains • Flexible framework • Perform data science with images 4/5/16 GTC 2016 19
CONFIDENTIAL 20
Recommend
More recommend