approximating likelihood ratios with calibrated
play

Approximating likelihood ratios with calibrated classifiers Gilles - PowerPoint PPT Presentation

Approximating likelihood ratios with calibrated classifiers Gilles Louppe June 22, 2016 MLHEP, Lund Joint work with Kyle Cranmer Juan Pavez New York University Federico Santa Mar a University See paper (Cranmer et al., 2015) for full


  1. Approximating likelihood ratios with calibrated classifiers Gilles Louppe June 22, 2016 MLHEP, Lund

  2. Joint work with Kyle Cranmer Juan Pavez New York University Federico Santa Mar´ ıa University See paper (Cranmer et al., 2015) for full details. 2 / 23

  3. Studying the constituents of the universe (c) Jorge Cham 3 / 23

  4. Collecting data (c) Jorge Cham 4 / 23

  5. Testing for new physics p (data | theory + X ) p (data | theory) (c) Jorge Cham 5 / 23

  6. Likelihood-free setup • Complex simulator p parameterized by θ ; • Samples x ∼ p can be generated on-demand; • ... but the likelihood p ( x | θ ) cannot be evaluated! p = ⊗ 6 / 23

  7. Simple hypothesis testing • Assume some observed data D = { x 1 , . . . , x n } ; • Test a null θ = θ 0 against an alternative θ = θ 1 ; • The Neyman-Pearson lemma states that the most powerful test statistic is p X ( x | θ 0 ) � λ ( D ; θ 0 , θ 1 ) = p X ( x | θ 1 ) . x ∈D • ... but neither p X ( x | θ 0 ) nor p X ( x | θ 1 ) can be evaluated! 7 / 23

  8. Straight approximation 1. Approximate p X ( x | θ 0 ) and p X ( x | θ 1 ) individually, using density estimation algorithms; 2. Evaluate their ratio r ( x ; θ 0 , θ 1 ). Works fine for low-dimensional data, but because of the curse of dimensionality, this is in general a difficult problem! Moreover, it is not even necessary! p X ( x | θ 0 ) p X ( x | θ 1 ) = r ( x ; θ 0 , θ 1 ) / When solving a problem of interest, do not solve a more general problem as an intermediate step. – Vladimir Vapnik 8 / 23

  9. Likehood ratio invariance under change of variable Theorem. The likelihood ratio is invariant under the change of variable U = s ( X ), provided s ( x ) is monotonic with r ( x ). r ( x ) = p X ( x | θ 0 ) p X ( x | θ 1 ) = p U ( s ( x ) | θ 0 ) p U ( s ( x ) | θ 1 ) 9 / 23

  10. Approximating likelihood ratios with classifiers • Well, a classifier trained to distinguish x ∼ p 0 from x ∼ p 1 approximates p X ( x | θ 1 ) s ∗ ( x ) = p X ( x | θ 0 ) + p X ( x | θ 1 ) , which is monotonic with r ( x ). • Estimating p ( s ( x ) | θ ) is now easy, since the change of variable s ( x ) projects x in a 1D space, where only the informative content of the ratio is preserved. This can be carried out using density estimation or calibration algorithms (histograms, KDE, isotonic regression, etc). • Disentangle training from calibration. 10 / 23

  11. Inference and composite hypothesis testing Approximated likelihood ratios can be used for inference, since ˆ θ = arg max p ( D| θ ) θ p ( x | θ ) � = arg max p ( x | θ 1 ) θ x ∈D p ( s ( x ; θ, θ 1 ) | θ ) � = arg max (1) p ( s ( x ; θ, θ 1 ) | θ 1 ) θ x ∈D where θ 1 is fixed and s ( x ; θ, θ 1 ) is a family of classifiers parameterized by ( θ, θ 1 ). Accordingly, generalized (or profile) likelihood ratio tests can be evaluated in the same way. 11 / 23

  12. Parameterized learning For inference, we need to build a family s ( x ; θ, θ 1 ) of classifiers. • One could build a classifier s independently for all θ, θ 1 . But this is computationally expensive and would not guarantee a smooth evolution of s ( x ; θ, θ 1 ) as θ varies. • Solution: build a single parameterized classifier instead, where parameters are additional input features (Cranmer et al., 2015; Baldi et al., 2016). T := {} ; while size( T ) < N do Draw θ 0 ∼ π Θ 0 ; Draw x ∼ p ( x | θ 0 ); T := T ∪ { (( x , θ 0 , θ 1 ) , y = 0) } ; Draw θ 1 ∼ π Θ 1 ; Draw x ∼ p ( x | θ 1 ); T := T ∪ { (( x , θ 0 , θ 1 ) , y = 1) } ; end while Learn a single classifier s ( x ; θ 0 , θ 1 ) from T . 12 / 23

  13. Example: Inference from multidimensional data Let assume 5D data x generated from the following process p 0 : 1. z := ( z 0 , z 1 , z 2 , z 3 , z 4 ), such that z 0 ∼ N ( µ = α, σ = 1), 5 0 X1 z 1 ∼ N ( µ = β, σ = 3), 5 0 1 z 2 ∼ Mixture( 1 2 N ( µ = − 2 , σ = 6 1) , 1 0 X2 2 N ( µ = 2 , σ = 0 . 5)), 6 1 2 z 3 ∼ Exponential( λ = 3), and 3 0 z 4 ∼ Exponential( λ = 0 . 5); X3 3 6 2. x := R z , where R is a fixed semi-positive 2 1 definite 5 × 5 matrix defining a fixed 8 X4 4 projection of z into the observed space. 0 3 0 3 0 5 0 5 2 6 0 6 6 3 0 3 0 4 8 2 1 1 1 X0 X1 X2 X3 X4 Our goal is to infer the values α and Observed data D β based on D . Check out (Louppe et al., 2016) to reproduce this example. 13 / 23

  14. Example: Inference from multidimensional data Recipe: 1. Build a single parameterized classifier s ( x ; θ 0 , θ 1 ), in this case a 2-layer NN trained on 5+2 features, with the alternative fixed to θ 1 = ( α = 0 , β = 0). α, ˆ 2. Find the approximated MLE ˆ β by solving Eqn. 1. Solve Eqn. 1 using likelihood scans or through optimization. Since the generator is inexpensive, p ( s ( x ; θ 0 , θ 1 ) | θ ) can be calibrated on-the-fly, for every candidate ( α, β ), e.g. using histograms. 3. Construct the log-likelihood ratio (LLR) statistic − 2 log Λ( α, β ) = − 2 log p ( D| α, β ) α, ˆ p ( D| ˆ β ) 14 / 23

  15. Approx. LLR (smoothed by a Exact − 2 log Λ( α, β ) Gaussian Process) 0.6 0.6 0.8 0.8 β β 1.0 1.0 1.2 1.2 1.4 1.4 0.90 0.95 1.00 1.05 1.10 1.15 0.90 0.95 1.00 1.05 1.10 1.15 α α 0 4 8 12 16 20 24 28 32 α =1, β = − 1 Exact MLE Approx. MLE 15 / 23

  16. Diagnostics In practice ˆ r (ˆ s ( x ; θ 0 , θ 1 )) will not be exact. Diagnostic procedures are needed to assess the quality of this approximation. 1. For inference, the value of the MLE ˆ θ should be independent of the value of θ 1 used in the denominator of the ratio. 2. Train a classifier to distinguish between unweighted samples from p ( x | θ 0 ) and samples from p ( x | θ 1 ) weighted by ˆ r (ˆ s ( x ; θ 0 , θ 1 )). 14 1.0 Exact 12 Approx., θ 1 =( α =0 ,β =1) Approx., θ 1 =( α =1 ,β = − 1) 0.8 10 Approx., θ 1 =( α =0 ,β = − 1) True Positive Rate ± 1 σ , θ 1 =( α =0 ,β = − 1) 0.6 − 2logΛ( θ ) 8 6 0.4 4 p ( x | θ 1 ) r ( x | θ 0 ,θ 1 ) exact 0.2 p ( x | θ 1 ) no weights 2 p ( x | θ 1 ) r ( x | θ 0 ,θ 1 ) approx. 0 0.0 0.7 0.8 0.9 1.0 1.1 1.2 1.3 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate α 16 / 23

  17. Density ratio estimation The density ratio r ( x ; θ 0 , θ 1 ) = p ( x | θ 0 ) p ( x | θ 1 ) appears in many other fundamental statistical inference problems, including • transfer learning, • outlier detection, • divergence estimation, • ... For all of them, the proposed approximation can be used as a drop-in replacement! 17 / 23

  18. Transfer learning Under the assumption that train and test data are drawn iid from a same distribution p , 1 � � L ( ϕ ( x i )) → L ( ϕ ( x )) p ( x ) d x , N x i as training data increases, i.e. as N → ∞ . Minimizing L over training data is therefore a good strategy. 18 / 23

  19. Transfer learning Under the assumption that train and test data are drawn iid from a same distribution p , 1 � � L ( ϕ ( x i )) → L ( ϕ ( x )) p train ( x ) d x , N x i as training data increases, i.e. as N → ∞ . But we want to be good on test data, i.e., minimize � L ( ϕ ( x )) p test ( x ) d x . Minimizing L over training data is therefore a bad strategy! 19 / 23

  20. Importance weighting Reweight samples by p test ( x i ) p train ( x i ) , such that 1 p test ( x i ) � p test ( x ) � p train ( x i ) L ( ϕ ( x i )) → p train ( x ) L ( ϕ ( x )) p train ( x ) d x , N x i as training data increases, i.e. as N → ∞ . Again, p test ( x i ) p train ( x i ) cannot be evaluated directly, but approximated likelihood ratios can be used as a drop-in replacement. 20 / 23

  21. Example p 0 : α = − 2 , β = 2 versus p 1 : α = 0 , β = 0 p 0 versus p 0 p 1 p 1 21 / 23

  22. Example p 0 versus ˆ rp 1 22 / 23

  23. Summary • We proposed an approach for approximating LR in the likelihood-free setup. • Evaluating likelihood ratios reduces to supervised learning. Both problems are deeply connected. • Alternative to Approximate Bayesian Computation, without the need to define a prior over parameters. 23 / 23

  24. References Baldi, P., Cranmer, K., Faucett, T., Sadowski, P., and Whiteson, D. (2016). Parameterized Machine Learning for High-Energy Physics. arXiv preprint arXiv:1601.07913 . Cranmer, K., Pavez, J., and Louppe, G. (2015). Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169 . Louppe, G., Cranmer, K., and Pavez, J. (2016). carl: a likelihood-free inference toolbox. http://dx.doi.org/10.5281/zenodo.47798 , https://github.com/diana-hep/carl .

Recommend


More recommend