boosted density estimation remastered
play

Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock - PowerPoint PPT Presentation

Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock 2,1,3 1 The Australian National University 2 CSIRO Data61 3 The University of Sydney Quick Summary Learn a density function incrementally Use classifiers for the


  1. Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock 2,1,3 1 The Australian National University 2 CSIRO Data61 3 The University of Sydney

  2. Quick Summary • Learn a density function incrementally • Use classifiers for the incremental updates (similar to GAN discriminators) • Unlike other state of the art attempts, achieve strong convergence results (geometric) using a weak learning assumption on the classifiers (in the paper!) 1

  3. sup E Q 0 [log D ] − E P [log(1 − D )] D : X→ (0 , 1) 2

  4. def def D Take f ( t ) = t log t − ( t + 1) log( t + 1) and ϕ ( D ) = 1 − D . Then sup E Q 0 [log D ] − E P [log(1 − D )] D : X→ (0 , 1) E Q 0 [ f ′ ◦ ϕ ◦ D ] − E P [ f ∗ ◦ f ′ ◦ ϕ ◦ D ] = sup D : X→ (0 , 1) E Q 0 [ f ′ ◦ d ] − E P [ f ∗ ◦ f ′ ◦ d ] = sup d : X→ (0 , ∞ ) � f ′ ◦ d P � � f ∗ ◦ f ′ ◦ d P � = E Q 0 − E P d Q 0 d Q 0 3

  5. def def D Take f ( t ) = t log t − ( t + 1) log( t + 1) and ϕ ( D ) = 1 − D . Then sup E Q 0 [log D ] − E P [log(1 − D )] D : X→ (0 , 1) E Q 0 [ f ′ ◦ ϕ ◦ D ] − E P [ f ∗ ◦ f ′ ◦ ϕ ◦ D ] = sup D : X→ (0 , 1) E Q 0 [ f ′ ◦ d ] − E P [ f ∗ ◦ f ′ ◦ d ] = sup d : X→ (0 , ∞ ) � f ′ ◦ d P � � f ∗ ◦ f ′ ◦ d P � = E Q 0 − E P d Q 0 d Q 0 Recall: � � f ( x ) d P ∀ f : f ( x ) P (d x ) = ( x ) Q 0 (d x ) d Q 0 3

  6. Main Idea E Q 0 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] d 1 ∈ argmax d ′ : X→ (0 , ∞ ) 1. Find d 1 as above 4

  7. Main Idea E Q 0 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] d 1 ∈ argmax d ′ : X→ (0 , ∞ ) 1. Find d 1 as above 2. Multiply d 1 ( x ) Q 0 (d x ) to find P (d x ) 4

  8. Main Idea E Q 0 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] d 1 ∈ argmax d ′ : X→ (0 , ∞ ) 1. Find d 1 as above 2. Multiply d 1 ( x ) Q 0 (d x ) to find P (d x ) 3. Finished. Get a job at a hedge fund next door Unfortunately this is not so simple since in practice we can only approximately solve the maximisation. 4

  9. Main Idea E Q 0 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] d 1 ∈ argmax d ′ : X→ (0 , ∞ ) 1. Find d 1 as above 2. Multiply d 1 ( x ) Q 0 (d x ) to find P (d x ) 3. Finished. Get a job at a hedge fund next door Unfortunately this is not so simple since in practice we can only approximately solve the maximisation. Sadface. 4

  10. Solution E Q t − 1 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] argmax d t ∈ d ′ : X→ (0 , ∞ ) Q t = 1 � Q t (d x ) = d α t ˜ t ( x ) · ˜ ˜ d ˜ def Q t − 1 (d x ) , where = Q t , Z t Q t , Z t 1. Some step size parameters α t ∈ (0 , 1) 2. Treat the updates as classifiers d t = exp ◦ c t 5

  11. Solution E Q t − 1 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] argmax d t ∈ d ′ : X→ (0 , ∞ ) Q t = 1 � Q t (d x ) = d α t ˜ t ( x ) · ˜ ˜ d ˜ def Q t − 1 (d x ) , where = Q t , Z t Q t , Z t 1. Some step size parameters α t ∈ (0 , 1) 2. Treat the updates as classifiers d t = exp ◦ c t • The classifiers are distinguishing between samples originating from P and Q t − 1 like in a GAN • However unlike a GAN there is not necessarily a simple fast sampler for Q t − 1 , but there is a closed-form density function 5

  12. Solution E Q t − 1 [ f ′ ◦ d ′ ] − E P [ f ∗ ◦ f ′ ◦ d ′ ] argmax d t ∈ d ′ : X→ (0 , ∞ ) Q t = 1 � Q t (d x ) = d α t ˜ t ( x ) · ˜ ˜ d ˜ def Q t − 1 (d x ) , where = Q t , Z t Q t , Z t 1. Some step size parameters α t ∈ (0 , 1) 2. Treat the updates as classifiers d t = exp ◦ c t • The classifiers are distinguishing between samples originating from P and Q t − 1 like in a GAN • However unlike a GAN there is not necessarily a simple fast sampler for Q t − 1 , but there is a closed-form density function Convergence of Q t → P in KL-divergence with a weak learning assumption on the updates as classifiers. With additional minimal assumptions: geometric convergence. 5

  13. Experiments (0.5 if Q t → P ) 0 . 8 Accuracy 0 . 6 0 . 5 0 . 4 4 (lower is better) KL( P, Q t ) 3 2 1 0 0 1 2 3 4 5 t = 0 t = 1 t = 2 t = 3 6

  14. Thanks for listening, come chat to us at poster #161. (Bring beer!) 7

Recommend


More recommend