Loss factorization, weakly supervised learning and label noise robustness � � Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni � Australian National University, Data61 (ex NICTA), � Ecole Polytechnique, Sony CS Labs, � Max Planck Institute of Mathematics in the Sciences � �
In 1 slide � 3 log (1 + e − x ) Loss functions factor � 2 l e ( x ) l o ( x ) � 1 ` ( x ) = ` e ( x ) + ` o ( x ) � 0 and so their risks, isolating a sufficient statistic � − 1 for the labels, . � − 1 − 2 0 1 2 µ x �
In 1 slide � 3 log (1 + e − x ) Loss functions factor � 2 l e ( x ) l o ( x ) � 1 ` ( x ) = ` e ( x ) + ` o ( x ) � 0 and so their risks, isolating a sufficient statistic � − 1 for the labels, . � − 1 − 2 0 1 2 µ x � Weakly supervised learning : (1) estimate and (2) plug it into � ` ( x ) µ and call standard algorithms. E.g., SGD: � � θ t +1 θ t � ⌘ r ` ( ± h θ t , x i i ) � 1 2 ⌘ a µ .
In 1 slide � 3 log (1 + e − x ) Loss functions factor � 2 l e ( x ) l o ( x ) � 1 ` ( x ) = ` e ( x ) + ` o ( x ) � 0 and so their risks, isolating a sufficient statistic � − 1 for the labels, . � − 1 − 2 0 1 2 µ x � Weakly supervised learning : (1) estimate and (2) plug it into � ` ( x ) µ and call standard algorithms. E.g., SGD: � � θ t +1 θ t � ⌘ r ` ( ± h θ t , x i i ) � 1 2 ⌘ a µ . For asymmetric label noise with rates , an unbiased estimator is � p + , p − � y − ( p − − p + ) � . ˆ = E ( x ,y ) µ x . 1 − p − − p +
Preliminary � { 1, . . . , m } • Binary classification � sampled from S = { ( x i , y i ) , i ∈ [ m ] } D R d × { − 1 , 1 } � over � • Learn a linear (or kernel) model � h ∈ H • Minimize the empirical risk associated with a surrogate loss � ` ( x ) � argmin E S [ ` ( yh ( x ))] = argmin R S , ` ( h ) h ∈ H h ∈ H
Mean operator & linear-odd losses � • Mean operator � m = E S [ y x ] = 1 . X µ S y i x i � m i =1
Mean operator & linear-odd losses � • Mean operator � m = E S [ y x ] = 1 . X µ S y i x i � m i =1 • -linear-odd loss, � a - lol a 1 / 2 · ( ` ( x ) − ` ( − x )) = ` o ( x ) = ax , for any a ∈ R generic x argument
Loss factorization � • smoothness Linear model � h nor convexity 1 • Linear-odd loss � 2 ( ` ( x ) − ` ( − x )) = ` o ( x ) = ax required • Define: “double sample” � . = { ( x i , σ ) , i ∈ [ m ] , σ ∈ {± 1 }} S 2 x
Loss factorization � • smoothness Linear model � h nor convexity 1 • Linear-odd loss � 2 ( ` ( x ) − ` ( − x )) = ` o ( x ) = ax required • Define: “double sample” � . = { ( x i , σ ) , i ∈ [ m ] , σ ∈ {± 1 }} S 2 x R S , ` ( h ) = = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )
Loss factorization: proof � R S , ` ( h ) = h i = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )
Loss factorization: proof � R S , ` ( h ) = h i d d o + n e v e = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )
Loss factorization: proof � R S , ` ( h ) = h i d d o + n e v e = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )
Loss factorization: proof � R S , ` ( h ) = h i d d o + n e v e = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x linear ` o and h = 1 2 R S 2 x , ` ( h ) + a · h ( µ S ) y c n i e c ffi u s o r f f y o µ
Linear-odd losses: examples � • Logistic loss & exponential family � m m e y h θ , x i i � h θ , µ i = ⇣ 1 + e � 2 y i h θ , x i i ⌘ X X X log log i =1 y 2 Y i =1
Linear-odd losses: examples � • Logistic loss & exponential family � m m e y h θ , x i i � h θ , µ i = ⇣ 1 + e � 2 y i h θ , x i i ⌘ X X X log log i =1 y 2 Y i =1 loss ` odd term ` o ` ( x ) − ax lol ⇢ -loss ⇢ | x | − ⇢ x + 1 − ⇢ x ( ⇢ ≥ 0) unhinged 1 − x − x perceptron max(0 , − x ) − x double-hinge max( − x, 1 / 2 max(0 , 1 − x )) − x a ` + ` ? ( − x ) /b ` − x/ (2 b ` ) spl logistic log(1 + e − x ) − x/ 2 (1 − x ) 2 square − 2 x √ 1 + x 2 − x Matsushita − x
Generalization bound � • Loss � ` is a - lol and L -Lipschitz R d ◆ X = { x : k x k 2 X < 1 } and H = { θ : k θ k 2 B < 1 } • Bounds � • Bounded loss � . c ( X, B ) = max y ∈ {± 1 } ` ( yXB ) ˆ • Let � = argmin θ ∈ H R S , ` ( θ ) θ .
Generalization bound � • Loss � ` is a - lol and L -Lipschitz R d ◆ X = { x : k x k 2 X < 1 } and H = { θ : k θ k 2 B < 1 } • Bounds � • Bounded loss � . c ( X, B ) = max y ∈ {± 1 } ` ( yXB ) ˆ • Let � = argmin θ ∈ H R S , ` ( θ ) θ . Then for any δ > 0, with probability at least 1 � δ : p ! 2 + 1 · XBL R D , ` (ˆ p m + θ ) � inf θ ∈ H R D , ` ( θ ) 4 s c ( X, B ) L 1 ✓ 1 ◆ · m log + 2 | a | B · k µ D � µ S k 2 2 δ
Generalization bound � • Loss � ` is a - lol and L -Lipschitz R d ◆ X = { x : k x k 2 X < 1 } and H = { θ : k θ k 2 B < 1 } • Bounds � • Bounded loss � . c ( X, B ) = max y ∈ {± 1 } ` ( yXB ) ˆ • Let � = argmin θ ∈ H R S , ` ( θ ) θ . Then for any δ > 0, with probability at least 1 � δ : complexity p ! 2 + 1 · XBL of space H R D , ` (ˆ p m + θ ) � inf θ ∈ H R D , ` ( θ ) 4 s c ( X, B ) L 1 ✓ 1 ◆ · m log + 2 | a | B · k µ D � µ S k 2 2 δ s ✓ 2 d ◆ d non-linearity 2 | a | XB m log δ
Weakly supervised learning � corrupt sample → ˜ → ˜ D D S − − − − − − − − − � • Weak labels may be � wrong (noisy), missing, multi-instance, etc. � � �
Weakly supervised learning � corrupt sample → ˜ → ˜ D D S − − − − − − − − − � • Weak labels may be � wrong (noisy), missing, multi-instance, etc. � � • 2-step approach: � � (1) Estimate from weak labels � µ ˜ S � (2) Plug it into and call any algorithm ` � for risk minimization on � S 2 x
Example: SGD (step 2) � Algorithm 1 µ sgd Input : S 2 x , µ , ` is a - lol ; θ 0 0 For any t = 1 , 2 , . . . until convergence Pick i 2 { 1 , . . . , | S 2 x |} at random ⌘ 1 /t Pick any v 2 @` ( y i h θ t , x i i ) θ t +1 θ t � ⌘ ( v + a µ / 2 ) Output : θ t +1
Example: SGD (step 2) � Algorithm 1 µ sgd Input : S 2 x , µ , ` is a - lol ; θ 0 0 For any t = 1 , 2 , . . . until convergence Pick i 2 { 1 , . . . , | S 2 x |} at random ⌘ 1 /t only changes Pick any v 2 @` ( y i h θ t , x i i ) wrt SGD θ t +1 θ t � ⌘ ( v + a µ / 2 ) Output : θ t +1 • In the paper: proximal algorithms �
A unifying approach � Learning from label proportions with � • logistic loss [N.Quadrianto et al. ’09] � • symmetric proper loss [G. Patrini et al. ’14] � � Learning with noisy labels with � • logistic loss [Gao et al. ’16] �
Asymmetric label noise � ˜ Sample corrupted by y i ) } m S = { ( x i , ˜ i =1 asymmetric noise rates . � p + , p − �
Asymmetric label noise � ˜ Sample corrupted by y i ) } m S = { ( x i , ˜ i =1 asymmetric noise rates � p + , p − � By the method of [Natarajan et al. ’13] an unbiased estimator of is � µ S � y − ( p − − p + ) � ˆ � = E ˜ . µ S x S 1 − p − − p + � This is step (1), then run -SGD for (2). � µ
Generalization bound under noise � ˆ • Same as before, except that now � θ = argmin θ ∈ H ˆ S , ` ( θ ) R ˜ Then for any δ > 0, with probability at least 1 − δ : √ ! 2 + 1 · XBL R D , ` (ˆ θ ) − inf θ ∈ H R D , ` ( θ ) ≤ √ m + 4 s s ✓ 2 ◆ ✓ 2 d ◆ c ( X, B ) L 1 2 | a | XB d m log + m log 2 1 − p − − p + δ δ
Generalization bound under noise � ˆ • Same as before, except that now � θ = argmin θ ∈ H ˆ S , ` ( θ ) R ˜ Then for any δ > 0, with probability at least 1 − δ : √ y t x i e ! p l m o c 2 + 1 · XBL R D , ` (ˆ d h e c θ ) − inf θ ∈ H R D , ` ( θ ) ≤ √ m + u t o n u 4 s s ✓ 2 ◆ ✓ 2 d ◆ c ( X, B ) L 1 2 | a | XB d m log + m log 2 1 − p − − p + δ δ noise a ff ects the linear term only
Recommend
More recommend