loss factorization weakly supervised learning and label
play

Loss factorization, weakly supervised learning and label noise - PowerPoint PPT Presentation

Loss factorization, weakly supervised learning and label noise robustness Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni Australian National University, Data61 (ex NICTA), Ecole Polytechnique, Sony CS Labs, Max


  1. Loss factorization, weakly supervised learning and label noise robustness � � Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni � Australian National University, Data61 (ex NICTA), � Ecole Polytechnique, Sony CS Labs, � Max Planck Institute of Mathematics in the Sciences � �

  2. In 1 slide � 3 log (1 + e − x ) Loss functions factor � 2 l e ( x ) l o ( x ) � 1 ` ( x ) = ` e ( x ) + ` o ( x ) � 0 and so their risks, isolating a sufficient statistic � − 1 for the labels, . � − 1 − 2 0 1 2 µ x �

  3. In 1 slide � 3 log (1 + e − x ) Loss functions factor � 2 l e ( x ) l o ( x ) � 1 ` ( x ) = ` e ( x ) + ` o ( x ) � 0 and so their risks, isolating a sufficient statistic � − 1 for the labels, . � − 1 − 2 0 1 2 µ x � Weakly supervised learning : (1) estimate and (2) plug it into � ` ( x ) µ and call standard algorithms. E.g., SGD: � � θ t +1 θ t � ⌘ r ` ( ± h θ t , x i i ) � 1 2 ⌘ a µ .

  4. In 1 slide � 3 log (1 + e − x ) Loss functions factor � 2 l e ( x ) l o ( x ) � 1 ` ( x ) = ` e ( x ) + ` o ( x ) � 0 and so their risks, isolating a sufficient statistic � − 1 for the labels, . � − 1 − 2 0 1 2 µ x � Weakly supervised learning : (1) estimate and (2) plug it into � ` ( x ) µ and call standard algorithms. E.g., SGD: � � θ t +1 θ t � ⌘ r ` ( ± h θ t , x i i ) � 1 2 ⌘ a µ . For asymmetric label noise with rates , an unbiased estimator is � p + , p − �  y − ( p − − p + ) � . ˆ = E ( x ,y ) µ x . 1 − p − − p +

  5. Preliminary � { 1, . . . , m } • Binary classification � sampled from S = { ( x i , y i ) , i ∈ [ m ] } D R d × { − 1 , 1 } � over � • Learn a linear (or kernel) model � h ∈ H • Minimize the empirical risk associated with a surrogate loss � ` ( x ) � argmin E S [ ` ( yh ( x ))] = argmin R S , ` ( h ) h ∈ H h ∈ H

  6. Mean operator & linear-odd losses � • Mean operator � m = E S [ y x ] = 1 . X µ S y i x i � m i =1

  7. Mean operator & linear-odd losses � • Mean operator � m = E S [ y x ] = 1 . X µ S y i x i � m i =1 • -linear-odd loss, � a - lol a 1 / 2 · ( ` ( x ) − ` ( − x )) = ` o ( x ) = ax , for any a ∈ R generic x argument

  8. Loss factorization � • smoothness Linear model � h nor convexity 1 • Linear-odd loss � 2 ( ` ( x ) − ` ( − x )) = ` o ( x ) = ax required • Define: “double sample” � . = { ( x i , σ ) , i ∈ [ m ] , σ ∈ {± 1 }} S 2 x

  9. Loss factorization � • smoothness Linear model � h nor convexity 1 • Linear-odd loss � 2 ( ` ( x ) − ` ( − x )) = ` o ( x ) = ax required • Define: “double sample” � . = { ( x i , σ ) , i ∈ [ m ] , σ ∈ {± 1 }} S 2 x R S , ` ( h ) = = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )

  10. Loss factorization: proof � R S , ` ( h ) = h i = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )

  11. Loss factorization: proof � R S , ` ( h ) = h i d d o + n e v e = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )

  12. Loss factorization: proof � R S , ` ( h ) = h i d d o + n e v e = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )

  13. Loss factorization: proof � R S , ` ( h ) = h i d d o + n e v e = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x linear ` o and h = 1 2 R S 2 x , ` ( h ) + a · h ( µ S ) y c n i e c ffi u s o r f f y o µ

  14. Linear-odd losses: examples � • Logistic loss & exponential family � m m e y h θ , x i i � h θ , µ i = ⇣ 1 + e � 2 y i h θ , x i i ⌘ X X X log log i =1 y 2 Y i =1

  15. Linear-odd losses: examples � • Logistic loss & exponential family � m m e y h θ , x i i � h θ , µ i = ⇣ 1 + e � 2 y i h θ , x i i ⌘ X X X log log i =1 y 2 Y i =1 loss ` odd term ` o ` ( x ) − ax lol ⇢ -loss ⇢ | x | − ⇢ x + 1 − ⇢ x ( ⇢ ≥ 0) unhinged 1 − x − x perceptron max(0 , − x ) − x double-hinge max( − x, 1 / 2 max(0 , 1 − x )) − x a ` + ` ? ( − x ) /b ` − x/ (2 b ` ) spl logistic log(1 + e − x ) − x/ 2 (1 − x ) 2 square − 2 x √ 1 + x 2 − x Matsushita − x

  16. Generalization bound � • Loss � ` is a - lol and L -Lipschitz R d ◆ X = { x : k x k 2  X < 1 } and H = { θ : k θ k 2  B < 1 } • Bounds � • Bounded loss � . c ( X, B ) = max y ∈ {± 1 } ` ( yXB ) ˆ • Let � = argmin θ ∈ H R S , ` ( θ ) θ .

  17. Generalization bound � • Loss � ` is a - lol and L -Lipschitz R d ◆ X = { x : k x k 2  X < 1 } and H = { θ : k θ k 2  B < 1 } • Bounds � • Bounded loss � . c ( X, B ) = max y ∈ {± 1 } ` ( yXB ) ˆ • Let � = argmin θ ∈ H R S , ` ( θ ) θ . Then for any δ > 0, with probability at least 1 � δ : p ! 2 + 1 · XBL R D , ` (ˆ p m + θ ) � inf θ ∈ H R D , ` ( θ )  4 s c ( X, B ) L 1 ✓ 1 ◆ · m log + 2 | a | B · k µ D � µ S k 2 2 δ

  18. Generalization bound � • Loss � ` is a - lol and L -Lipschitz R d ◆ X = { x : k x k 2  X < 1 } and H = { θ : k θ k 2  B < 1 } • Bounds � • Bounded loss � . c ( X, B ) = max y ∈ {± 1 } ` ( yXB ) ˆ • Let � = argmin θ ∈ H R S , ` ( θ ) θ . Then for any δ > 0, with probability at least 1 � δ : complexity p ! 2 + 1 · XBL of space H R D , ` (ˆ p m + θ ) � inf θ ∈ H R D , ` ( θ )  4 s c ( X, B ) L 1 ✓ 1 ◆ · m log + 2 | a | B · k µ D � µ S k 2 2 δ s ✓ 2 d ◆ d non-linearity 2 | a | XB m log δ

  19. Weakly supervised learning � corrupt sample → ˜ → ˜ D D S − − − − − − − − − � • Weak labels may be � wrong (noisy), missing, multi-instance, etc. � � �

  20. Weakly supervised learning � corrupt sample → ˜ → ˜ D D S − − − − − − − − − � • Weak labels may be � wrong (noisy), missing, multi-instance, etc. � � • 2-step approach: � � (1) Estimate from weak labels � µ ˜ S � (2) Plug it into and call any algorithm ` � for risk minimization on � S 2 x

  21. Example: SGD (step 2) � Algorithm 1 µ sgd Input : S 2 x , µ , ` is a - lol ; θ 0 0 For any t = 1 , 2 , . . . until convergence Pick i 2 { 1 , . . . , | S 2 x |} at random ⌘ 1 /t Pick any v 2 @` ( y i h θ t , x i i ) θ t +1 θ t � ⌘ ( v + a µ / 2 ) Output : θ t +1

  22. Example: SGD (step 2) � Algorithm 1 µ sgd Input : S 2 x , µ , ` is a - lol ; θ 0 0 For any t = 1 , 2 , . . . until convergence Pick i 2 { 1 , . . . , | S 2 x |} at random ⌘ 1 /t only changes Pick any v 2 @` ( y i h θ t , x i i ) wrt SGD θ t +1 θ t � ⌘ ( v + a µ / 2 ) Output : θ t +1 • In the paper: proximal algorithms �

  23. A unifying approach � Learning from label proportions with � • logistic loss [N.Quadrianto et al. ’09] � • symmetric proper loss [G. Patrini et al. ’14] � � Learning with noisy labels with � • logistic loss [Gao et al. ’16] �

  24. Asymmetric label noise � ˜ Sample corrupted by y i ) } m S = { ( x i , ˜ i =1 asymmetric noise rates . � p + , p − �

  25. Asymmetric label noise � ˜ Sample corrupted by y i ) } m S = { ( x i , ˜ i =1 asymmetric noise rates � p + , p − � By the method of [Natarajan et al. ’13] an unbiased estimator of is � µ S �  y − ( p − − p + ) � ˆ � = E ˜ . µ S x S 1 − p − − p + � This is step (1), then run -SGD for (2). � µ

  26. Generalization bound under noise � ˆ • Same as before, except that now � θ = argmin θ ∈ H ˆ S , ` ( θ ) R ˜ Then for any δ > 0, with probability at least 1 − δ : √ ! 2 + 1 · XBL R D , ` (ˆ θ ) − inf θ ∈ H R D , ` ( θ ) ≤ √ m + 4 s s ✓ 2 ◆ ✓ 2 d ◆ c ( X, B ) L 1 2 | a | XB d m log + m log 2 1 − p − − p + δ δ

  27. Generalization bound under noise � ˆ • Same as before, except that now � θ = argmin θ ∈ H ˆ S , ` ( θ ) R ˜ Then for any δ > 0, with probability at least 1 − δ : √ y t x i e ! p l m o c 2 + 1 · XBL R D , ` (ˆ d h e c θ ) − inf θ ∈ H R D , ` ( θ ) ≤ √ m + u t o n u 4 s s ✓ 2 ◆ ✓ 2 d ◆ c ( X, B ) L 1 2 | a | XB d m log + m log 2 1 − p − − p + δ δ noise a ff ects the linear term only

Recommend


More recommend