Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock 2,1,3 1 The Australian National University 2 CSIRO Data61 3 The University of Sydney
Quick Summary • Learn a density function incrementally • Use classifiers for the incremental updates (similar to GAN discriminators) • Unlike other state of the art attempts, achieve strong convergence results (geometric) using a weak learning assumption on the classifiers (in the paper!) 1
sup E Q 0 [log D ] − E P [log(1 − D )] D : X→ (0 , 1) 2
def def D Take f ( t ) = t log t − ( t + 1) log( t + 1) and ϕ ( D ) = 1 − D . Then sup E Q 0 [log D ] − E P [log(1 − D )] D : X→ (0 , 1) E Q 0 [ f ′ ◦ ϕ ◦ D ] − E P [ f ∗ ◦ f ′ ◦ ϕ ◦ D ] = sup D : X→ (0 , 1) E Q 0 [ f ′ ◦ d ] − E P [ f ∗ ◦ f ′ ◦ d ] = sup d : X→ (0 , ∞ ) � f ′ ◦ d P � � f ∗ ◦ f ′ ◦ d P � = E Q 0 − E P d Q 0 d Q 0 3
def def D Take f ( t ) = t log t − ( t + 1) log( t + 1) and ϕ ( D ) = 1 − D . Then sup E Q 0 [log D ] − E P [log(1 − D )] D : X→ (0 , 1) E Q 0 [ f ′ ◦ ϕ ◦ D ] − E P [ f ∗ ◦ f ′ ◦ ϕ ◦ D ] = sup D : X→ (0 , 1) E Q 0 [ f ′ ◦ d ] − E P [ f ∗ ◦ f ′ ◦ d ] = sup d : X→ (0 , ∞ ) � f ′ ◦ d P � � f ∗ ◦ f ′ ◦ d P � = E Q 0 − E P d Q 0 d Q 0 Recall: � � f ( x ) d P ∀ f : f ( x ) P (d x ) = ( x ) Q 0 (d x ) d Q 0 3
Main Idea E Q 0 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] d 1 ∈ argmax d ′ : X→ (0 , ∞ ) 1. Find d 1 as above 4
Main Idea E Q 0 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] d 1 ∈ argmax d ′ : X→ (0 , ∞ ) 1. Find d 1 as above 2. Multiply d 1 ( x ) Q 0 (d x ) to find P (d x ) 4
Main Idea E Q 0 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] d 1 ∈ argmax d ′ : X→ (0 , ∞ ) 1. Find d 1 as above 2. Multiply d 1 ( x ) Q 0 (d x ) to find P (d x ) 3. Finished. Get a job at a hedge fund next door Unfortunately this is not so simple since in practice we can only approximately solve the maximisation. 4
Main Idea E Q 0 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] d 1 ∈ argmax d ′ : X→ (0 , ∞ ) 1. Find d 1 as above 2. Multiply d 1 ( x ) Q 0 (d x ) to find P (d x ) 3. Finished. Get a job at a hedge fund next door Unfortunately this is not so simple since in practice we can only approximately solve the maximisation. Sadface. 4
Solution E Q t − 1 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] argmax d t ∈ d ′ : X→ (0 , ∞ ) Q t = 1 � Q t (d x ) = d α t ˜ t ( x ) · ˜ ˜ d ˜ def Q t − 1 (d x ) , where = Q t , Z t Q t , Z t 1. Some step size parameters α t ∈ (0 , 1) 2. Treat the updates as classifiers d t = exp ◦ c t 5
Solution E Q t − 1 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] argmax d t ∈ d ′ : X→ (0 , ∞ ) Q t = 1 � Q t (d x ) = d α t ˜ t ( x ) · ˜ ˜ d ˜ def Q t − 1 (d x ) , where = Q t , Z t Q t , Z t 1. Some step size parameters α t ∈ (0 , 1) 2. Treat the updates as classifiers d t = exp ◦ c t • The classifiers are distinguishing between samples originating from P and Q t − 1 like in a GAN • However unlike a GAN there is not necessarily a simple fast sampler for Q t − 1 , but there is a closed-form density function 5
Solution E Q t − 1 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] argmax d t ∈ d ′ : X→ (0 , ∞ ) Q t = 1 � Q t (d x ) = d α t ˜ t ( x ) · ˜ ˜ d ˜ def Q t − 1 (d x ) , where = Q t , Z t Q t , Z t 1. Some step size parameters α t ∈ (0 , 1) 2. Treat the updates as classifiers d t = exp ◦ c t • The classifiers are distinguishing between samples originating from P and Q t − 1 like in a GAN • However unlike a GAN there is not necessarily a simple fast sampler for Q t − 1 , but there is a closed-form density function Convergence of Q t → P in KL-divergence with a weak learning assumption on the updates as classifiers. With additional minimal assumptions: geometric convergence. 5
Experiments (0.5 if Q t → P ) 0 . 8 Accuracy 0 . 6 0 . 5 0 . 4 4 (lower is better) KL( P, Q t ) 3 2 1 0 0 1 2 3 4 5 t = 0 t = 1 t = 2 t = 3 6
Thanks for listening, come chat to us at poster #161. (Bring beer!) 7
Recommend
More recommend