bootstrapping labels for one hundred million images
play

Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker - PowerPoint PPT Presentation

Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker We are drowning in data Data Never Sleeps 2.0 - DOMO (2014) 4/5/16 GTC 2016 2 Ripe Opportunities Many problems to solve Limitless amounts of image data Deep


  1. Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker

  2. We are drowning in data Data Never Sleeps 2.0 - DOMO (2014) 4/5/16 GTC 2016 2

  3. Ripe Opportunities • Many problems to solve • Limitless amounts of image data • Deep Learning pushing State of the Art everywhere • GPUs making everything possible 4/5/16 GTC 2016 3

  4. The Problem • Deep Learning is data-driven • ImageNet has 1.2 million training examples • Few large, labeled image datasets exist • It’s expensive to label data • Our datasets are +100m images • Few qualified to label • Highly sensitive customer data • Necessary subject matter expertise 4/5/16 GTC 2016 4

  5. Ever labeled data? • Not as easy as it seems: • It’s repetitive • Less accurate over time • One day computers will do it all for you? • But not yet • Can some of this effort be automated? 4/5/16 GTC 2016 5

  6. Many Approaches • Mechanical Turk • Pre-trained classifiers • Costly • What if pre-trained • Time Consuming classifiers don’t work • Clustering well on data? • Expensive • Active learning • How many clusters • Iterative labeling • What features to • Open problem use? Can we combine these into something useful? 4/5/16 GTC 2016 6

  7. The Goal • Inspired by Image Similarity experience and Jeremy Howard TED talk • Use machines to filter the noise • Reduce repetitive tasks • Leverage human labeler • Understand the data • Label iteratively • Allow exploration 4/5/16 GTC 2016 7

  8. Our Approach 4/5/16 GTC 2016 8

  9. Our Approach Compare Image Hashes to filter Duplicate Images 4/5/16 GTC 2016 9

  10. Our Approach 4/5/16 GTC 2016 10

  11. Our Approach 4/5/16 GTC 2016 11

  12. Our Approach 4/5/16 GTC 2016 12

  13. Our Approach Prevents over- focusing on one portion of feature space 4/5/16 GTC 2016 13

  14. Our Approach 4/5/16 GTC 2016 14

  15. Our Approach Label Images on the boundary of the class 4/5/16 GTC 2016 15

  16. Our Approach Improve CNN features for labeled classes 4/5/16 GTC 2016 16

  17. GUI 4/5/16 GTC 2016 17

  18. Hardware • Cirrascale GB5670 • 56 CPU Cores • 8x NVIDIA K-80 • 512GB DDR4 • 1 TB SSD 4/5/16 GTC 2016 18

  19. Benefits • Create Large, Labeled Datasets • High quality • Allows data exploration • Dramatic time reduction • ~3-5x faster initially • Multiplicative efficiency gains • Flexible framework • Perform data science with images 4/5/16 GTC 2016 19

  20. CONFIDENTIAL 20

Recommend


More recommend