Linking losses for density ratio and class-probability estimation Aditya Krishna Menon Cheng Soon Ong NICTA and The Australian National University 0 / 34
Linking losses for density ratio and class-probability estimation Aditya Krishna Menon Cheng Soon Ong Data61 and The Australian National University 0 / 34
Class-probability estimation (CPE) From labelled instances +" +" +" #" +" #" #" #" 1 / 34
Class-probability estimation (CPE) From labelled instances, estimate probability of instance being +’ve e.g. using logistic regression +" +" +" #" +" #" #" #" 0.6" 1 / 34
Density ratio estimation (DRE) Given samples from densities p , q 2 p q 1.5 1 0.5 0 −5 0 5 2 / 34
Density ratio estimation (DRE) Given samples from densities p , q , estimate density ratio r = p / q 2 p q r 1.5 1 0.5 0 −5 0 5 2 / 34
Application: covariate shift adaptation Marginal training distribution +" +" +" #" +" #" #" #" 3 / 34
Application: covariate shift adaptation Marginal training distribution � = marginal test distribution +" +" +" +" +" +" #" #" +" +" #" #" #" #" #" #" 3 / 34
Application: covariate shift adaptation Marginal training distribution � = marginal test distribution +" +" +" +" +" +" #" #" +" +" #" #" #" #" #" #" Can overcome by reweighting training instances use ratio between test and test densities train e.g. weighted class-probability estimator 3 / 34
This paper Formal link between CPE and DRE Proper losses Logistic Bregman Exponential CPE DRE Square Hinge KLIEP Ranking LSIF 4 / 34
This paper Formal link between CPE and DRE existing DRE approaches → implicitly performing CPE Proper losses Logistic Bregman Exponential CPE DRE Square Hinge KLIEP Ranking LSIF 4 / 34
This paper Formal link between CPE and DRE existing DRE approaches → implicitly performing CPE CPE → Bregman minimisation for DRE Proper losses Logistic Bregman Exponential CPE DRE Square Hinge KLIEP Ranking LSIF 4 / 34
This paper Formal link between CPE and DRE existing DRE approaches → implicitly performing CPE CPE → Bregman minimisation for DRE new application of DRE losses to “top ranking” Proper losses Logistic Bregman Exponential CPE DRE Square Hinge KLIEP Ranking LSIF 4 / 34
DRE and CPE: formally 5 / 34
Distributions for learning with binary labels Fix an instance space X (e.g. R n ) Let D be a distribution over X ×{± 1 } , with P ( Y = 1 ) = 1 2 and ( P ( x ) , Q ( x )) = ( P ( X = x | Y = 1 ) , P ( X = x | Y = − 1 )) Class conditionals 1 0 . 8 0 . 6 0 . 4 0 . 2 − 3 − 2 − 1 1 2 3 6 / 34
Distributions for learning with binary labels Fix an instance space X (e.g. R n ) Let D be a distribution over X ×{± 1 } , with P ( Y = 1 ) = 1 2 and ( P ( x ) , Q ( x )) = ( P ( X = x | Y = 1 ) , P ( X = x | Y = − 1 )) ( M ( x ) , η ( x )) = ( P ( X = x ) , P ( Y = 1 | X = x )) Class conditionals Marginal and class-probability function 1 0 . 8 0 . 8 0 . 6 0 . 6 0 . 4 0 . 4 0 . 2 0 . 2 − 3 − 2 − 1 1 2 3 − 3 − 2 − 1 1 2 3 6 / 34
Scorers, losses, risks +" +" A scorer is any s : X → R +" #" +" #" e.g. linear scorer s : x �→ � w , x � #" #" 7 / 34
Scorers, losses, risks +" +" A scorer is any s : X → R +" #" +" #" e.g. linear scorer s : x �→ � w , x � #" #" 6 A loss is any ℓ : {± 1 }× R → R + 5 4 ℓ ( y, v ) 3 2 e.g. logistic loss ℓ : ( y , v ) �→ log ( 1 + e − yv ) 1 0 −3 −2 −1 0 1 2 3 v 7 / 34
Scorers, losses, risks +" +" A scorer is any s : X → R +" #" +" #" e.g. linear scorer s : x �→ � w , x � #" #" 6 A loss is any ℓ : {± 1 }× R → R + 5 4 ℓ ( y, v ) 3 2 e.g. logistic loss ℓ : ( y , v ) �→ log ( 1 + e − yv ) 1 0 −3 −2 −1 0 1 2 3 v 1" The risk of scorer s wrt loss ℓ and distribution D is 5" 10" L ( s ; D ,ℓ ) = E ( X , Y ) ∼ D [ ℓ ( Y , s ( X ))] 10" 20" 4" 30" 6" average loss on a random sample 7 / 34
CPE versus DRE Given samples S ∼ D N , with D = ( P , Q ) = ( M , η ) : 8 / 34
CPE versus DRE Given samples S ∼ D N , with D = ( P , Q ) = ( M , η ) : Class-probability estimation (CPE) Estimate η class-probability function +" +" +" #" +" #" #" #" 0.6" 8 / 34
CPE versus DRE Given samples S ∼ D N , with D = ( P , Q ) = ( M , η ) : Class-probability estimation (CPE) Density ratio estimation (DRE) Estimate η Estimate r = p / q class-probability function class-conditional density ratio +" 2 p +" q +" r 1.5 #" +" #" 1 #" 0.5 #" 0.6" 0 −5 0 5 8 / 34
CPE approaches: proper composite losses For suitable S ⊆ R X , find L ( s ; D ,ℓ ) argmin s ∈ S where ℓ is such that, for some invertible Ψ : [ 0 , 1 ] → R , s ∈ R X L ( s ; D ,ℓ ) = Ψ ◦ η argmin η = Ψ − 1 ◦ s estimate ˆ 9 / 34
CPE approaches: proper composite losses For suitable S ⊆ R X , find L ( s ; D ,ℓ ) argmin s ∈ S where ℓ is such that, for some invertible Ψ : [ 0 , 1 ] → R , s ∈ R X L ( s ; D ,ℓ ) = Ψ ◦ η argmin η = Ψ − 1 ◦ s estimate ˆ Such an ℓ is called strictly proper composite with link Ψ 9 / 34
Examples of proper composite losses 6 6 5 5 4 4 ℓ ( y, v ) ℓ ( y, v ) 3 3 2 2 1 1 0 0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 v v Logistic loss Exponential loss Ψ − 1 : v �→ σ ( v ) Ψ − 1 : v �→ σ ( 2 v ) 6 5 4 ℓ ( y, v ) 3 2 1 0 −3 −2 −1 0 1 2 3 v Square hinge loss Ψ − 1 : v �→ min ( max ( 0 , ( v + 1 ) / 2 ) , 1 ) 10 / 34
DRE approaches: divergence minimisation For suitable S ⊆ R X , find KLIEP : (Sugiyama et al., 2008) KL ( p � q ⊙ s ) argmin s ∈ S constrained KL minimisation LSIF : (Kanamori et al., 2009) � ( r ( X ) − s ( X )) 2 � argmin E X ∼ Q s ∈ S direct least squares minimisation 11 / 34
Story so far Logistic Exponential CPE DRE Square Hinge KLIEP LSIF 12 / 34
Roadmap We begin by showing existing DRE losses implicitly perform CPE Proper losses Logistic Exponential CPE DRE Square Hinge KLIEP LSIF 12 / 34
Existing DRE losses are proper composite 13 / 34
Existing DRE approaches Suppose D = ( P , Q ) KLIEP : (Sugiyama et al., 2008) argmin KL ( p � q ⊙ s ) s ∈ S LSIF : (Kanamori et al., 2009) � ( r ( X ) − s ( X )) 2 � argmin E X ∼ Q s ∈ S 14 / 34
Existing DRE approaches as loss minimisation Suppose D = ( P , Q ) KLIEP : (Sugiyama et al., 2008) argmin E ( X , Y ) ∼ D [ ℓ ( Y , s ( X ))] s ∈ S ℓ ( − 1 , v ) = a · v and ℓ ( 1 , v ) = − log v for suitable a > 0 LSIF : (Kanamori et al., 2009) E ( X , Y ) ∼ D [ ℓ ( Y , s ( X ))] argmin s ∈ S ℓ ( − 1 , v ) = 1 2 · v 2 and ℓ ( 1 , v ) = − v 14 / 34
Existing DRE approaches as loss minimisation Suppose D = ( P , Q ) KLIEP : (Sugiyama et al., 2008) argmin E ( X , Y ) ∼ D [ ℓ ( Y , s ( X ))] s ∈ S ℓ ( − 1 , v ) = a · v and ℓ ( 1 , v ) = − log v for suitable a > 0 LSIF : (Kanamori et al., 2009) E ( X , Y ) ∼ D [ ℓ ( Y , s ( X ))] argmin s ∈ S ℓ ( − 1 , v ) = 1 2 · v 2 and ℓ ( 1 , v ) = − v These are no ordinary losses 14 / 34
Existing DRE approaches as CPE For u ∈ [ 0 , 1 ] , let u Ψ dr : u �→ 1 − u . Lemma The LSIF loss is strictly proper composite with link Ψ dr . The KLIEP loss with a > 0 is strictly proper composite with link a − 1 · Ψ dr . 15 / 34
Existing DRE approaches as CPE For u ∈ [ 0 , 1 ] , let u Ψ dr : u �→ 1 − u . Lemma The LSIF loss is strictly proper composite with link Ψ dr . The KLIEP loss with a > 0 is strictly proper composite with link a − 1 · Ψ dr . KLIEP and LSIF perform CPE in disguise! 15 / 34
Proof For LSIF and KLIEP (with a = 1 ), ℓ ′ ( 1 , v ) ℓ ′ ( − 1 , v ) = − 1 v , so that 16 / 34
Proof For LSIF and KLIEP (with a = 1 ), ℓ ′ ( 1 , v ) ℓ ′ ( − 1 , v ) = − 1 v , so that 1 f ( v ) = 1 − ℓ ′ ( 1 , v ) ℓ ′ ( − 1 , v ) v = 1 + v 16 / 34
Proof For LSIF and KLIEP (with a = 1 ), ℓ ′ ( 1 , v ) ℓ ′ ( − 1 , v ) = − 1 v , so that 1 f ( v ) = 1 − ℓ ′ ( 1 , v ) ℓ ′ ( − 1 , v ) v = 1 + v = Ψ − 1 dr ( v ) . 16 / 34
Proof For LSIF and KLIEP (with a = 1 ), ℓ ′ ( 1 , v ) ℓ ′ ( − 1 , v ) = − 1 v , so that 1 f ( v ) = 1 − ℓ ′ ( 1 , v ) ℓ ′ ( − 1 , v ) v = 1 + v = Ψ − 1 dr ( v ) . Proper compositeness follows from (Reid and Williamson, 2010). The link Ψ dr is especially suitable for DRE... 16 / 34
Another view of Ψ dr Bayes’ rule shows targets of DRE and CPE are linked: ( ∀ x ∈ X ) r ( x ) · = p ( x ) q ( x ) 17 / 34
Recommend
More recommend