Approximating likelihood ratios with calibrated classifiers Gilles Louppe June 22, 2016 MLHEP, Lund
Joint work with Kyle Cranmer Juan Pavez New York University Federico Santa Mar´ ıa University See paper (Cranmer et al., 2015) for full details. 2 / 23
Studying the constituents of the universe (c) Jorge Cham 3 / 23
Collecting data (c) Jorge Cham 4 / 23
Testing for new physics p (data | theory + X ) p (data | theory) (c) Jorge Cham 5 / 23
Likelihood-free setup • Complex simulator p parameterized by θ ; • Samples x ∼ p can be generated on-demand; • ... but the likelihood p ( x | θ ) cannot be evaluated! p = ⊗ 6 / 23
Simple hypothesis testing • Assume some observed data D = { x 1 , . . . , x n } ; • Test a null θ = θ 0 against an alternative θ = θ 1 ; • The Neyman-Pearson lemma states that the most powerful test statistic is p X ( x | θ 0 ) � λ ( D ; θ 0 , θ 1 ) = p X ( x | θ 1 ) . x ∈D • ... but neither p X ( x | θ 0 ) nor p X ( x | θ 1 ) can be evaluated! 7 / 23
Straight approximation 1. Approximate p X ( x | θ 0 ) and p X ( x | θ 1 ) individually, using density estimation algorithms; 2. Evaluate their ratio r ( x ; θ 0 , θ 1 ). Works fine for low-dimensional data, but because of the curse of dimensionality, this is in general a difficult problem! Moreover, it is not even necessary! p X ( x | θ 0 ) p X ( x | θ 1 ) = r ( x ; θ 0 , θ 1 ) / When solving a problem of interest, do not solve a more general problem as an intermediate step. – Vladimir Vapnik 8 / 23
Likehood ratio invariance under change of variable Theorem. The likelihood ratio is invariant under the change of variable U = s ( X ), provided s ( x ) is monotonic with r ( x ). r ( x ) = p X ( x | θ 0 ) p X ( x | θ 1 ) = p U ( s ( x ) | θ 0 ) p U ( s ( x ) | θ 1 ) 9 / 23
Approximating likelihood ratios with classifiers • Well, a classifier trained to distinguish x ∼ p 0 from x ∼ p 1 approximates p X ( x | θ 1 ) s ∗ ( x ) = p X ( x | θ 0 ) + p X ( x | θ 1 ) , which is monotonic with r ( x ). • Estimating p ( s ( x ) | θ ) is now easy, since the change of variable s ( x ) projects x in a 1D space, where only the informative content of the ratio is preserved. This can be carried out using density estimation or calibration algorithms (histograms, KDE, isotonic regression, etc). • Disentangle training from calibration. 10 / 23
Inference and composite hypothesis testing Approximated likelihood ratios can be used for inference, since ˆ θ = arg max p ( D| θ ) θ p ( x | θ ) � = arg max p ( x | θ 1 ) θ x ∈D p ( s ( x ; θ, θ 1 ) | θ ) � = arg max (1) p ( s ( x ; θ, θ 1 ) | θ 1 ) θ x ∈D where θ 1 is fixed and s ( x ; θ, θ 1 ) is a family of classifiers parameterized by ( θ, θ 1 ). Accordingly, generalized (or profile) likelihood ratio tests can be evaluated in the same way. 11 / 23
Parameterized learning For inference, we need to build a family s ( x ; θ, θ 1 ) of classifiers. • One could build a classifier s independently for all θ, θ 1 . But this is computationally expensive and would not guarantee a smooth evolution of s ( x ; θ, θ 1 ) as θ varies. • Solution: build a single parameterized classifier instead, where parameters are additional input features (Cranmer et al., 2015; Baldi et al., 2016). T := {} ; while size( T ) < N do Draw θ 0 ∼ π Θ 0 ; Draw x ∼ p ( x | θ 0 ); T := T ∪ { (( x , θ 0 , θ 1 ) , y = 0) } ; Draw θ 1 ∼ π Θ 1 ; Draw x ∼ p ( x | θ 1 ); T := T ∪ { (( x , θ 0 , θ 1 ) , y = 1) } ; end while Learn a single classifier s ( x ; θ 0 , θ 1 ) from T . 12 / 23
Example: Inference from multidimensional data Let assume 5D data x generated from the following process p 0 : 1. z := ( z 0 , z 1 , z 2 , z 3 , z 4 ), such that z 0 ∼ N ( µ = α, σ = 1), 5 0 X1 z 1 ∼ N ( µ = β, σ = 3), 5 0 1 z 2 ∼ Mixture( 1 2 N ( µ = − 2 , σ = 6 1) , 1 0 X2 2 N ( µ = 2 , σ = 0 . 5)), 6 1 2 z 3 ∼ Exponential( λ = 3), and 3 0 z 4 ∼ Exponential( λ = 0 . 5); X3 3 6 2. x := R z , where R is a fixed semi-positive 2 1 definite 5 × 5 matrix defining a fixed 8 X4 4 projection of z into the observed space. 0 3 0 3 0 5 0 5 2 6 0 6 6 3 0 3 0 4 8 2 1 1 1 X0 X1 X2 X3 X4 Our goal is to infer the values α and Observed data D β based on D . Check out (Louppe et al., 2016) to reproduce this example. 13 / 23
Example: Inference from multidimensional data Recipe: 1. Build a single parameterized classifier s ( x ; θ 0 , θ 1 ), in this case a 2-layer NN trained on 5+2 features, with the alternative fixed to θ 1 = ( α = 0 , β = 0). α, ˆ 2. Find the approximated MLE ˆ β by solving Eqn. 1. Solve Eqn. 1 using likelihood scans or through optimization. Since the generator is inexpensive, p ( s ( x ; θ 0 , θ 1 ) | θ ) can be calibrated on-the-fly, for every candidate ( α, β ), e.g. using histograms. 3. Construct the log-likelihood ratio (LLR) statistic − 2 log Λ( α, β ) = − 2 log p ( D| α, β ) α, ˆ p ( D| ˆ β ) 14 / 23
Approx. LLR (smoothed by a Exact − 2 log Λ( α, β ) Gaussian Process) 0.6 0.6 0.8 0.8 β β 1.0 1.0 1.2 1.2 1.4 1.4 0.90 0.95 1.00 1.05 1.10 1.15 0.90 0.95 1.00 1.05 1.10 1.15 α α 0 4 8 12 16 20 24 28 32 α =1, β = − 1 Exact MLE Approx. MLE 15 / 23
Diagnostics In practice ˆ r (ˆ s ( x ; θ 0 , θ 1 )) will not be exact. Diagnostic procedures are needed to assess the quality of this approximation. 1. For inference, the value of the MLE ˆ θ should be independent of the value of θ 1 used in the denominator of the ratio. 2. Train a classifier to distinguish between unweighted samples from p ( x | θ 0 ) and samples from p ( x | θ 1 ) weighted by ˆ r (ˆ s ( x ; θ 0 , θ 1 )). 14 1.0 Exact 12 Approx., θ 1 =( α =0 ,β =1) Approx., θ 1 =( α =1 ,β = − 1) 0.8 10 Approx., θ 1 =( α =0 ,β = − 1) True Positive Rate ± 1 σ , θ 1 =( α =0 ,β = − 1) 0.6 − 2logΛ( θ ) 8 6 0.4 4 p ( x | θ 1 ) r ( x | θ 0 ,θ 1 ) exact 0.2 p ( x | θ 1 ) no weights 2 p ( x | θ 1 ) r ( x | θ 0 ,θ 1 ) approx. 0 0.0 0.7 0.8 0.9 1.0 1.1 1.2 1.3 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate α 16 / 23
Density ratio estimation The density ratio r ( x ; θ 0 , θ 1 ) = p ( x | θ 0 ) p ( x | θ 1 ) appears in many other fundamental statistical inference problems, including • transfer learning, • outlier detection, • divergence estimation, • ... For all of them, the proposed approximation can be used as a drop-in replacement! 17 / 23
Transfer learning Under the assumption that train and test data are drawn iid from a same distribution p , 1 � � L ( ϕ ( x i )) → L ( ϕ ( x )) p ( x ) d x , N x i as training data increases, i.e. as N → ∞ . Minimizing L over training data is therefore a good strategy. 18 / 23
Transfer learning Under the assumption that train and test data are drawn iid from a same distribution p , 1 � � L ( ϕ ( x i )) → L ( ϕ ( x )) p train ( x ) d x , N x i as training data increases, i.e. as N → ∞ . But we want to be good on test data, i.e., minimize � L ( ϕ ( x )) p test ( x ) d x . Minimizing L over training data is therefore a bad strategy! 19 / 23
Importance weighting Reweight samples by p test ( x i ) p train ( x i ) , such that 1 p test ( x i ) � p test ( x ) � p train ( x i ) L ( ϕ ( x i )) → p train ( x ) L ( ϕ ( x )) p train ( x ) d x , N x i as training data increases, i.e. as N → ∞ . Again, p test ( x i ) p train ( x i ) cannot be evaluated directly, but approximated likelihood ratios can be used as a drop-in replacement. 20 / 23
Example p 0 : α = − 2 , β = 2 versus p 1 : α = 0 , β = 0 p 0 versus p 0 p 1 p 1 21 / 23
Example p 0 versus ˆ rp 1 22 / 23
Summary • We proposed an approach for approximating LR in the likelihood-free setup. • Evaluating likelihood ratios reduces to supervised learning. Both problems are deeply connected. • Alternative to Approximate Bayesian Computation, without the need to define a prior over parameters. 23 / 23
References Baldi, P., Cranmer, K., Faucett, T., Sadowski, P., and Whiteson, D. (2016). Parameterized Machine Learning for High-Energy Physics. arXiv preprint arXiv:1601.07913 . Cranmer, K., Pavez, J., and Louppe, G. (2015). Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169 . Louppe, G., Cranmer, K., and Pavez, J. (2016). carl: a likelihood-free inference toolbox. http://dx.doi.org/10.5281/zenodo.47798 , https://github.com/diana-hep/carl .
Recommend
More recommend