making deep neural networks robust to label noise a loss
play

Making deep neural networks robust to label noise: a loss - PowerPoint PPT Presentation

Making deep neural networks robust to label noise: a loss correction approach Giorgio Patrini 23 July 2017 CVPR, Honolulu joint work with Alessandro Rozza, Aditya Krishna Menon, Richard Nock and Lizhen Qu ANU, Data61, Waynaut, University


  1. Making deep neural networks robust to label noise: � a loss correction approach Giorgio Patrini 23 July 2017 CVPR, Honolulu joint work with Alessandro Rozza, Aditya Krishna Menon, Richard Nock and Lizhen Qu ANU, Data61, Waynaut, University of Sydney

  2. Label noise: motivations “Data science becomes the art of extracting labels out of thin air” [Malach & Shalev-Shwartz 17] G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  3. Label noise: motivations “Data science becomes the art of extracting labels out of thin air” [Malach & Shalev-Shwartz 17] Labels from Web queries Crowd sourcing : ? : jaguar : leopard : cheetah G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  4. Previous work (sample) • Noise-aware deep nets (CV) – Good performance on specific domains, scalable – Heuristics – In many cases, need some clean labels [Sukhbaat ar et al. ICLR15, Krause et al. ECCV16, Xiao et al. CVPR15] • Theoretically robust loss functions (ML) – Theoretically sound – Unrealistic assumptions … knowing the noise distribution! [Natarajan et al. NIPS13, Patrini et al. ICML16] • Estimating the noise from noisy data [Menon et al. ICML15] G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  5. Contributions • Two procedures for loss correction . Loss/architecture/ dataset agnostic. • Theoretical guarantee: same model as without noise (in expectation). • Noise estimation, by using the same deep net. • Tests on MNIST, CIFAR10/100, IMDB with multiple nets (CNN, ResNets, LSTM, … ). SOTA on data of [Xiao et al. 15]. G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  6. Supervised learning • Sample from p ( x , y ) y ∈ { e j : j = 1 , . . . , c } • -class classification: c • Learn a neural network p ( y | x ) G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  7. Supervised learning • Sample from p ( x , y ) y ∈ { e j : j = 1 , . . . , c } • -class classification: c • Learn a neural network p ( y | x ) • Minimize the empirical risk associated with loss : ` ( y , p ( y | x )) argmin E S ` ( y , p ( y | x )) p ( y | x ) � > • Let ` ( e 1 , p ( y | x )) , . . . , ` ( e c , p ( y | x )) � ` ( p ( y | x )) = G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  8. Asymmetric label noise • Sample from p ( x , ˜ y ) • Corruption by asymmetric noise, defined by a transition matrix : T ∈ [0 , 1] c × c p (˜ y | y ) y = e j | y = e i ) ˜ T ij = p (˜ y y x Feature independent noise G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  9. Asymmetric label noise • Sample from p ( x , ˜ y ) • Corruption by asymmetric noise, defined by a transition matrix : T ∈ [0 , 1] c × c p (˜ y | y ) y = e j | y = e i ) ˜ T ij = p (˜ y y x Feature independent noise • How to be robust to such noise? G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  10. Backward loss correction • -class version of [Natarajan et al. 13] c ` ← ( p ( y | x )) = T − 1 ` ( p ( y | x )) • Rationale: linear combination of losses, weighted by the inverse of the noise probabilities • “One step back” in the Markov chain T G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  11. Backward loss correction: theory • Theorem: if is non-singular, is ` ← T unbiased . It follows that the models learned with/without noise are the same under noise expectation: y ` ← ( y , p ( y | x )) = argmin argmin E x , y ` ( y , p ( y | x )) E x , ˜ p ( y | x ) p ( y | x ) G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  12. Forward loss correction • Inspired by [Sukhbaatar et al. 15]: “absorbs” the noise in a top linear layer, emulating T ` ! ( p ( y | x )) = ` ( T > p ( y | x )) • Rationale: compare noisy labels with “noisified” predictions G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  13. Forward loss correction: theory • Theorem: if is non-singular, is such ` → T that the models with/without noise are the same under noise expectation* : y ` → ( y , p ( y | x )) = argmin argmin E x , y ` ( y , p ( y | x )) E x , ˜ p ( y | x ) p ( y | x ) * Technically, the loss needs to be proper composite here. Cross- entropy and square are OK. G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  14. Noise estimation • -class extension of [Menon et al. 15] c G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  15. Noise estimation • -class extension of [Menon et al. 15] c • Hp: there are some “perfect examples”, and the net can model very well p (˜ y | x ) G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  16. Noise estimation • -class extension of [Menon et al. 15] c • Hp: there are some “perfect examples”, and the net can model very well p (˜ y | x ) • First, train and get p (˜ y | x ) • Then estimate by ˆ T x i = argmax y = e i | x ) ¯ p (˜ x ∀ i, j y = e j | ¯ x i ) T ij = p (˜ G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  17. Noise estimation • -class extension of [Menon et al. 15] c • Hp: there are some “perfect examples”, and the net can model very well p (˜ y | x ) • First, train and get p (˜ y | x ) • Then estimate by ˆ T x i = argmax y = e i | x ) ¯ p (˜ x ∀ i, j y = e j | ¯ x i ) T ij = p (˜ • Rationale: mistakes on “perfect examples” must be due to the noise G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  18. Recap: the algorithm ˆ (1) Train the network on noisy data to obtain T y | x ) → ˆ argmin y ` ( y , p ( y | x )) = p (˜ E x , ˜ T p ( y | x ) (2) Re-train the network correcting with backward/forward loss, e.g. y ` ← ( y , p ( y | x )) argmin E x , ˜ p ( y | x ) n i e g n a h c o n n o i t a g a p o r p - k c a b G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  19. Empirics: models and datasets • Goal: show robustness independently from architecture and dataset Simulated noise: – MNIST: 2 x fully connected, dropout – IMDB: word embedding + LSTM – CIFAR10/100: various ResNets Real noise: – Clothing1M [Xiao et al. 15], 50-ResNet G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  20. Inject sparse, asymmetric T   1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0     0 0 .3 0 0 0 0 .7 0 0     0 0 0 .3 0 0 0 0 .7 0   T   0 0 0 0 1 0 0 0 0 0     0 0 0 0 0 .3 .7 0 0 0     0 0 0 0 0 .7 .3 0 0 0     0 .7 0 0 0 0 0 .3 0 0     0 0 0 0 0 0 0 0 1 0   0 0 0 0 0 0 0 0 0 1   1 ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ 1 ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏     .33 .67 ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ˆ   T   .35 ✏ .65 ✏ ✏ ✏ ✏ ✏ ✏ ✏     1 ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏     ✏ < 10 − 6 ✏ .29 .71 ✏ ✏ ✏ ✏ ✏ ✏ ✏     ✏ .73 .26 ✏ ✏ ✏ ✏ ✏ ✏ ✏     ✏ .75 .25 ✏ ✏ ✏ ✏ ✏ ✏ ✏     1 ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏   1 ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  21. Experiments with real noise Clothing1M [Xiao et al. CVPR15] • Trainset: 1M noisy label + 50k clean labels • Testset: 10k clean labels G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

  22. Experiments with real noise Clothing1M # model loss init training accuracy 50 k 72 . 63 1 AlexNet cross-. ImageNet 2 AlexNet cross-. #1 1 M, 50 k 76 . 22 3 2x AlexNet cross-. #1 1 M, 50 k 78 . 24 4 50-ResNet cross- ImageNet 1 M 68 . 94 5 50-ResNet backward ImageNet 1 M 69 . 13 6 50-ResNet forward ImageNet 1 M 69 . 84 50 k 75 . 19 7 50-ResNet cross-. ImageNet 8 50-ResNet cross-. #6 50 k 80 . 38 Recipe for SOTA: Our method • Pre-train: “forward loss” on 1M noisy labels • Fine-tune: cross-entropy on 50k clean labels G. Patrini | Making deep neural networks robust to label noise | github.com/giorgiop/loss-correction

Recommend


More recommend