Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National University 1 / 57
Learning from binary labels +" +" +" #" +" #" #" #" 2 / 57
Learning from binary labels +" ?" +" +" #" +" #" #" #" 3 / 57
Learning from binary labels +" +" +" #" +" #" #" #" 4 / 57
Learning from noisy labels #" #" +" +" +" #" +" #" 5 / 57
Learning from positive and unlabelled data ?" ?" +" ?" +" ?" ?" ?" 6 / 57
Learning from binary labels +" +" +" #" +" #" #" #" S ⇠ D n nature learner Goal : good classification wrt distribution D 7 / 57
Learning from corrupted labels +" +" +" #" +" #" #" #" S ⇠ D n S ⇠ D n corruptor nature learner Goal : good classification wrt (unobserved) distribution D 8 / 57
Paper summary Can we learn a good classifier from corrupted samples? 9 / 57
Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! 10 / 57
Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duPlessis and Sugiyama, 2014) ... 11 / 57
Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duPlessis and Sugiyama, 2014) ... This work: unified treatment via class-probability estimation analysis for general class of corruptions 12 / 57
Assumed corruption model 13 / 57
Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X ⇥ {± 1 } Constituent components of D : ( P ( x ) , Q ( x ) , π ) = ( P [ X = x | Y = 1 ] , P [ X = x | Y = � 1 ] , P [ Y = 1 ]) 14 / 57
Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X ⇥ {± 1 } Constituent components of D : ( P ( x ) , Q ( x ) , π ) = ( P [ X = x | Y = 1 ] , P [ X = x | Y = � 1 ] , P [ Y = 1 ]) ( M ( x ) , η ( x )) = ( P [ X = x ] , P [ Y = 1 | X = x ]) 15 / 57
Learning from corrupted binary labels S ⇠ D n S ⇠ D n corruptor nature learner Samples from corrupted distribution D = ( P , Q , π ) Goal : good classification wrt (unobserved) distribution D 16 / 57
Learning from corrupted binary labels S ⇠ D n S ⇠ D n corruptor nature learner Samples from corrupted distribution D = ( P , Q , π ) , where P = ( 1 � α ) · P + α · Q Q = β · P +( 1 � β ) · Q and π is arbitrary α , β are noise rates mutually contaminated distributions (Scott et al., 2013) Goal : good classification wrt (unobserved) distribution D 17 / 57
Special cases Label noise PU learning Labels flipped w.p. ρ Observe M instead of Q π = ( 1 � 2 ρ ) · π + ρ π = arbitrary α = π � 1 · ( 1 � π ) · ρ P = 1 · P + 0 · Q Q = M β = ( 1 � π ) � 1 · π · ρ = π · P +( 1 � π ) · Q #" ?" #" ?" +" +" +" ?" +" +" #" ?" +" ?" ?" #" 18 / 57
Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis 19 / 57
Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D , D , η ( x ) = φ α , β , π ( η ( x )) where φ α , β , π is strictly monotone for fixed α , β , π . 20 / 57
Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D , D , η ( x ) = φ α , β , π ( η ( x )) where φ α , β , π is strictly monotone for fixed α , β , π . Follows from Bayes’ rule: η ( x ) 1 � π · P ( x ) π 1 � η ( x ) = Q ( x ) 21 / 57
Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D , D , η ( x ) = φ α , β , π ( η ( x )) where φ α , β , π is strictly monotone for fixed α , β , π . Follows from Bayes’ rule: ( 1 � α ) · P ( x ) Q ( x ) + α η ( x ) 1 � π · P ( x ) π π 1 � η ( x ) = Q ( x ) = 1 � π · . β · P ( x ) Q ( x ) +( 1 � β ) 22 / 57
Corrupted class-probabilities: special cases Label noise PU learning π · η ( x ) η ( x ) = ( 1 � 2 ρ ) · η ( x )+ ρ η ( x ) = π · η ( x )+( 1 � π ) · π ρ unknown π unknown (Natarajan et al., 2013) (Ward et al., 2009) 23 / 57
Roadmap ˆ η D D class-prob corruptor nature classifier estimator Kernel logistic regression 24 / 57
Roadmap Exploit monotone relationship between η and η ˆ η D D class-prob ? corruptor nature classifier estimator Kernel logistic regression 25 / 57
Classification with noise rates 26 / 57
Class-probabilities and classification Many classification measures optimised by sign ( η ( x ) � t ) 0-1 error ! t = 1 2 Balanced error ! t = π F-score ! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) 27 / 57
Class-probabilities and classification Many classification measures optimised by sign ( η ( x ) � t ) 0-1 error ! t = 1 2 Balanced error ! t = π F-score ! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) We can relate this to thresholding of η ! 28 / 57
Corrupted class-probabilities and classification By monotone relationship, η ( x ) > t ( ) η ( x ) > φ α , β , π ( t ) . Threshold η at φ α , β , π ( t ) ! optimal classification on D Can translate into regret bound e.g. for 0-1 loss 29 / 57
Story so far Classification scheme requires: η t α , β , π noise oracle α , ˆ ˆ β , ˆ π ˆ η class-prob D D sign ( ˆ corruptor η ( x ) � φ ˆ π ( t )) nature classifier α , ˆ β , ˆ estimator 30 / 57
Story so far Classification scheme requires: η ! class-probability estimation t α , β , π noise oracle α , ˆ ˆ β , ˆ π ˆ η class-prob D D sign ( ˆ corruptor η ( x ) � φ ˆ π ( t )) nature classifier α , ˆ β , ˆ estimator Kernel logistic regression 31 / 57
Story so far Classification scheme requires: η ! class-probability estimation t ! if unknown, alternate approach (see poster) α , β , π noise oracle α , ˆ ˆ β , ˆ π ˆ η class-prob D D sign ( ˆ corruptor η ( x ) � φ ˆ π ( t )) nature classifier α , ˆ β , ˆ estimator Kernel logistic regression 32 / 57
Story so far Classification scheme requires: η ! class-probability estimation t ! if unknown, alternate approach (see poster) α , β , π ! can we estimate these? noise ? estimator α , ˆ ˆ β , ˆ π ˆ η D D class-prob sign ( ˆ η ( x ) � φ ˆ π ( t )) nature corruptor classifier α , ˆ β , ˆ estimator Kernel logistic regression 33 / 57
Estimating noise rates: some bad news π strongly non-identifiable! π allowed to be arbitrary (e.g. PU learning) α , β non-identifiable without assumptions (Scott et al., 2013) Can we estimate α , β under assumptions? 34 / 57
Weak separability assumption Assume that D is “weakly separable”: x 2 X η ( x ) = 0 min x 2 X η ( x ) = 1 max i.e. 9 deterministically +’ve and -’ve instances weaker than full separability 35 / 57
Weak separability assumption Assume that D is “weakly separable”: x 2 X η ( x ) = 0 min x 2 X η ( x ) = 1 max i.e. 9 deterministically +’ve and -’ve instances weaker than full separability Assumed range of η constrains observed range of η ! 36 / 57
Estimating noise rates Proposition Pick any weakly separable D . Then, for any D , α = η min · ( η max � π ) π · ( η max � η min ) and β = ( 1 � η max ) · ( π � η min ) ( 1 � π ) · ( η max � η min ) where η min = min x 2 X η ( x ) η max = max x 2 X η ( x ) α , β can be estimated from corrupted data alone 37 / 57
Estimating noise rates: special cases Label noise PU learning ρ = 1 � η max α = 0 = η min β = π π � η min = 1 � η max π π = · η max � η min η max 1 � π (Elkan and Noto, 2008), (Liu and Tao, 2014) c.f. mixture proportion estimate of (Scott et al., 2013) In these cases, π can be estimated as well 38 / 57
Story so far Optimal classification in general requires α , β , π Range of ˆ η noise estimator ˆ η α , ˆ ˆ β , ˆ π ˆ η D D class-prob sign ( ˆ corruptor η ( x ) � φ ˆ π ( t )) nature classifier α , ˆ β , ˆ estimator Kernel logistic regression 39 / 57
Recommend
More recommend