Loss factorization, weakly supervised learning and label noise - PowerPoint PPT Presentation

Loss factorization, weakly supervised learning and label noise robustness � � Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni � Australian National University, Data61 (ex NICTA), � Ecole Polytechnique, Sony CS Labs, � Max Planck Institute of Mathematics in the Sciences � �

In 1 slide � 3 log (1 + e − x ) Loss functions factor � 2 l e ( x ) l o ( x ) � 1 ` ( x ) = ` e ( x ) + ` o ( x ) � 0 and so their risks, isolating a sufficient statistic � − 1 for the labels, . � − 1 − 2 0 1 2 µ x �

In 1 slide � 3 log (1 + e − x ) Loss functions factor � 2 l e ( x ) l o ( x ) � 1 ` ( x ) = ` e ( x ) + ` o ( x ) � 0 and so their risks, isolating a sufficient statistic � − 1 for the labels, . � − 1 − 2 0 1 2 µ x � Weakly supervised learning : (1) estimate and (2) plug it into � ` ( x ) µ and call standard algorithms. E.g., SGD: � � θ t +1 θ t � ⌘ r ` ( ± h θ t , x i i ) � 1 2 ⌘ a µ .

In 1 slide � 3 log (1 + e − x ) Loss functions factor � 2 l e ( x ) l o ( x ) � 1 ` ( x ) = ` e ( x ) + ` o ( x ) � 0 and so their risks, isolating a sufficient statistic � − 1 for the labels, . � − 1 − 2 0 1 2 µ x � Weakly supervised learning : (1) estimate and (2) plug it into � ` ( x ) µ and call standard algorithms. E.g., SGD: � � θ t +1 θ t � ⌘ r ` ( ± h θ t , x i i ) � 1 2 ⌘ a µ . For asymmetric label noise with rates , an unbiased estimator is � p + , p − �  y − ( p − − p + ) � . ˆ = E ( x ,y ) µ x . 1 − p − − p +

Preliminary � { 1, . . . , m } • Binary classification � sampled from S = { ( x i , y i ) , i ∈ [ m ] } D R d × { − 1 , 1 } � over � • Learn a linear (or kernel) model � h ∈ H • Minimize the empirical risk associated with a surrogate loss � ` ( x ) � argmin E S [ ` ( yh ( x ))] = argmin R S , ` ( h ) h ∈ H h ∈ H

Mean operator & linear-odd losses � • Mean operator � m = E S [ y x ] = 1 . X µ S y i x i � m i =1

Mean operator & linear-odd losses � • Mean operator � m = E S [ y x ] = 1 . X µ S y i x i � m i =1 • -linear-odd loss, � a - lol a 1 / 2 · ( ` ( x ) − ` ( − x )) = ` o ( x ) = ax , for any a ∈ R generic x argument

Loss factorization � • smoothness Linear model � h nor convexity 1 • Linear-odd loss � 2 ( ` ( x ) − ` ( − x )) = ` o ( x ) = ax required • Define: “double sample” � . = { ( x i , σ ) , i ∈ [ m ] , σ ∈ {± 1 }} S 2 x

Loss factorization � • smoothness Linear model � h nor convexity 1 • Linear-odd loss � 2 ( ` ( x ) − ` ( − x )) = ` o ( x ) = ax required • Define: “double sample” � . = { ( x i , σ ) , i ∈ [ m ] , σ ∈ {± 1 }} S 2 x R S , ` ( h ) = = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )

Loss factorization: proof � R S , ` ( h ) = h i = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )

Loss factorization: proof � R S , ` ( h ) = h i d d o + n e v e = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x = 1 2 R S 2 x , ` ( h ) + a · h ( µ S )

Loss factorization: proof � R S , ` ( h ) = h i d d o + n e v e = E S ` ( � h ( x )) = 1 h i ` ( yh ( x )) + ` ( − yh ( x )) + ` ( yh ( x )) − ` ( − yh ( x )) 2 E S = 1 h i h i ` ( � h ( x )) + E S ` o ( h ( y x )) 2 E S 2 x linear ` o and h = 1 2 R S 2 x , ` ( h ) + a · h ( µ S ) y c n i e c ffi u s o r f f y o µ

Linear-odd losses: examples � • Logistic loss & exponential family � m m e y h θ , x i i � h θ , µ i = ⇣ 1 + e � 2 y i h θ , x i i ⌘ X X X log log i =1 y 2 Y i =1

Linear-odd losses: examples � • Logistic loss & exponential family � m m e y h θ , x i i � h θ , µ i = ⇣ 1 + e � 2 y i h θ , x i i ⌘ X X X log log i =1 y 2 Y i =1 loss ` odd term ` o ` ( x ) − ax lol ⇢ -loss ⇢ | x | − ⇢ x + 1 − ⇢ x ( ⇢ ≥ 0) unhinged 1 − x − x perceptron max(0 , − x ) − x double-hinge max( − x, 1 / 2 max(0 , 1 − x )) − x a ` + ` ? ( − x ) /b ` − x/ (2 b ` ) spl logistic log(1 + e − x ) − x/ 2 (1 − x ) 2 square − 2 x √ 1 + x 2 − x Matsushita − x

Generalization bound � • Loss � ` is a - lol and L -Lipschitz R d ◆ X = { x : k x k 2  X < 1 } and H = { θ : k θ k 2  B < 1 } • Bounds � • Bounded loss � . c ( X, B ) = max y ∈ {± 1 } ` ( yXB ) ˆ • Let � = argmin θ ∈ H R S , ` ( θ ) θ .

Generalization bound � • Loss � ` is a - lol and L -Lipschitz R d ◆ X = { x : k x k 2  X < 1 } and H = { θ : k θ k 2  B < 1 } • Bounds � • Bounded loss � . c ( X, B ) = max y ∈ {± 1 } ` ( yXB ) ˆ • Let � = argmin θ ∈ H R S , ` ( θ ) θ . Then for any δ > 0, with probability at least 1 � δ : p ! 2 + 1 · XBL R D , ` (ˆ p m + θ ) � inf θ ∈ H R D , ` ( θ )  4 s c ( X, B ) L 1 ✓ 1 ◆ · m log + 2 | a | B · k µ D � µ S k 2 2 δ

Generalization bound � • Loss � ` is a - lol and L -Lipschitz R d ◆ X = { x : k x k 2  X < 1 } and H = { θ : k θ k 2  B < 1 } • Bounds � • Bounded loss � . c ( X, B ) = max y ∈ {± 1 } ` ( yXB ) ˆ • Let � = argmin θ ∈ H R S , ` ( θ ) θ . Then for any δ > 0, with probability at least 1 � δ : complexity p ! 2 + 1 · XBL of space H R D , ` (ˆ p m + θ ) � inf θ ∈ H R D , ` ( θ )  4 s c ( X, B ) L 1 ✓ 1 ◆ · m log + 2 | a | B · k µ D � µ S k 2 2 δ s ✓ 2 d ◆ d non-linearity 2 | a | XB m log δ

Weakly supervised learning � corrupt sample → ˜ → ˜ D D S − − − − − − − − − � • Weak labels may be � wrong (noisy), missing, multi-instance, etc. � � �

Weakly supervised learning � corrupt sample → ˜ → ˜ D D S − − − − − − − − − � • Weak labels may be � wrong (noisy), missing, multi-instance, etc. � � • 2-step approach: � � (1) Estimate from weak labels � µ ˜ S � (2) Plug it into and call any algorithm ` � for risk minimization on � S 2 x

Example: SGD (step 2) � Algorithm 1 µ sgd Input : S 2 x , µ , ` is a - lol ; θ 0 0 For any t = 1 , 2 , . . . until convergence Pick i 2 { 1 , . . . , | S 2 x |} at random ⌘ 1 /t Pick any v 2 @` ( y i h θ t , x i i ) θ t +1 θ t � ⌘ ( v + a µ / 2 ) Output : θ t +1

Example: SGD (step 2) � Algorithm 1 µ sgd Input : S 2 x , µ , ` is a - lol ; θ 0 0 For any t = 1 , 2 , . . . until convergence Pick i 2 { 1 , . . . , | S 2 x |} at random ⌘ 1 /t only changes Pick any v 2 @` ( y i h θ t , x i i ) wrt SGD θ t +1 θ t � ⌘ ( v + a µ / 2 ) Output : θ t +1 • In the paper: proximal algorithms �

A unifying approach � Learning from label proportions with � • logistic loss [N.Quadrianto et al. ’09] � • symmetric proper loss [G. Patrini et al. ’14] � � Learning with noisy labels with � • logistic loss [Gao et al. ’16] �

Asymmetric label noise � ˜ Sample corrupted by y i ) } m S = { ( x i , ˜ i =1 asymmetric noise rates . � p + , p − �

Asymmetric label noise � ˜ Sample corrupted by y i ) } m S = { ( x i , ˜ i =1 asymmetric noise rates � p + , p − � By the method of [Natarajan et al. ’13] an unbiased estimator of is � µ S �  y − ( p − − p + ) � ˆ � = E ˜ . µ S x S 1 − p − − p + � This is step (1), then run -SGD for (2). � µ

Generalization bound under noise � ˆ • Same as before, except that now � θ = argmin θ ∈ H ˆ S , ` ( θ ) R ˜ Then for any δ > 0, with probability at least 1 − δ : √ ! 2 + 1 · XBL R D , ` (ˆ θ ) − inf θ ∈ H R D , ` ( θ ) ≤ √ m + 4 s s ✓ 2 ◆ ✓ 2 d ◆ c ( X, B ) L 1 2 | a | XB d m log + m log 2 1 − p − − p + δ δ

Generalization bound under noise � ˆ • Same as before, except that now � θ = argmin θ ∈ H ˆ S , ` ( θ ) R ˜ Then for any δ > 0, with probability at least 1 − δ : √ y t x i e ! p l m o c 2 + 1 · XBL R D , ` (ˆ d h e c θ ) − inf θ ∈ H R D , ` ( θ ) ≤ √ m + u t o n u 4 s s ✓ 2 ◆ ✓ 2 d ◆ c ( X, B ) L 1 2 | a | XB d m log + m log 2 1 − p − − p + δ δ noise a ff ects the linear term only

Loss factorization, weakly supervised learning and label noise - PowerPoint PPT Presentation

Loss factorization, weakly supervised learning and label noise robustness Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni Australian National University, Data61 (ex NICTA), Ecole Polytechnique, Sony CS Labs, Max

Blue Label Pilot-plant Reactor 1 Product Line-up Platinum Label Gold Label Blue Label Blue

AG! Blue Label Bench-top Reactor 1 Product line up Platinum Label Gold Label Blue Label Blue

free 18-May-17 Towards Weakly Supervised Image Understanding 1/50 Towards Weakly Supervised

Weakly Supervised Classification Weakly Supervised Classification and Robust Learning and Robust

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft

Weakly-Supervised Temporal Localization via Occurrence Count Learning Julien Schroeter

LID Challenge: Weakly Supervised Semantic Segmentation 3d place solution NoPeopleAllowed: The 3

Dual-Gradients Localization framework for Weakly Supervised Object Localization Chuangchuang Tan

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe

Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Work on Multi-label Classification Jesse Read Supervised by Bernhard Pfahringer

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

4.2: Isomorphism of Grammars In this section, we study grammar isomorphism, i.e., the way in

Search for H + and H ++ bosons with the CMS detector Nuno Almeida LIP Lisbon (On behalf of

The production of additional bosons and the impact on the Large Hadron Collider presented by Alan

Ev aluating Hyp otheses [Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4] Sample

Fully Homomorphic Encryption from the ground up Daniele Micciancio (UC San Diego) Eurocrypt

Reasoning Analytically About Password-Cracking Software Enze Alex Liu , Amanda Nakanishi,

Hausdorff operators in H p spaces, 0 < p < 1 Elijah Liflyand joint work with Akihiko

E (h) o r out in ( h 1 ) E out ( h 1 ) | > o r in ( h 2 ) E out ( h 2 ) | >

Loss factorization, weakly supervised learning and label noise - PowerPoint PPT Presentation

Loss factorization, weakly supervised learning and label noise robustness Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni Australian National University, Data61 (ex NICTA), Ecole Polytechnique, Sony CS Labs, Max

Blue Label Pilot-plant Reactor 1 Product Line-up Platinum Label Gold Label Blue Label Blue

AG! Blue Label Bench-top Reactor 1 Product line up Platinum Label Gold Label Blue Label Blue

free 18-May-17 Towards Weakly Supervised Image Understanding 1/50 Towards Weakly Supervised

Weakly Supervised Classification Weakly Supervised Classification and Robust Learning and Robust

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

Extreme Classification A New Paradigm for Ranking &amp; Recommendation Manik Varma Microsoft

Weakly-Supervised Temporal Localization via Occurrence Count Learning Julien Schroeter

LID Challenge: Weakly Supervised Semantic Segmentation 3d place solution NoPeopleAllowed: The 3

Dual-Gradients Localization framework for Weakly Supervised Object Localization Chuangchuang Tan

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe

Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Work on Multi-label Classification Jesse Read Supervised by Bernhard Pfahringer

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

4.2: Isomorphism of Grammars In this section, we study grammar isomorphism, i.e., the way in

Search for H + and H ++ bosons with the CMS detector Nuno Almeida LIP Lisbon (On behalf of

The production of additional bosons and the impact on the Large Hadron Collider presented by Alan

Ev aluating Hyp otheses [Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4] Sample

Fully Homomorphic Encryption from the ground up Daniele Micciancio (UC San Diego) Eurocrypt

Reasoning Analytically About Password-Cracking Software Enze Alex Liu , Amanda Nakanishi,

Hausdorff operators in H p spaces, 0 &lt; p &lt; 1 Elijah Liflyand joint work with Akihiko

E (h) o r out in ( h 1 ) E out ( h 1 ) | &gt; o r in ( h 2 ) E out ( h 2 ) | &gt;

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft

Hausdorff operators in H p spaces, 0 < p < 1 Elijah Liflyand joint work with Akihiko

E (h) o r out in ( h 1 ) E out ( h 1 ) | > o r in ( h 2 ) E out ( h 2 ) | >