IML Working Group, CERN 2018-10-12 Guiding New Physics Searches with Unsupervised Learning [DS, Jacques - 1807.06038] Andrea De Simone andrea.desimone@sissa.it
> New Physics ? Searches for New Physics Beyond the Standard Model have been negative so far… MAYBE: 1. New Physics (NP) is not accessible by LHC new particles are too light/heavy or interacting too weakly 2. We have not explored all the possibilities new physics may be buried under large bkg or hiding behind unusual signatures 2 A. De Simone
> New Physics ? “Don’t want to miss a thing” (in data) closer look at current data get ready for upcoming data from next run Model-independent search searches for specific models may be: - time-consuming - insensitive to unexpected/unknown processes 3 A. De Simone
> New Statistical Test Want a statistical test for NP which is: 1. model-independent: no assumption about underlying physical model to intepret data more general 2. non-parametric: compare two samples as a whole (not just their means, etc.) fewer assumptions, no max likelihood estim. 3. un-binned: high-dim feature space partitioned without rectangular bins retain full multi-dim info of data 4 A. De Simone
> Outline 1. Statistical test of dataset compatibility • Nearest-Neighbors Two-Sample Test • Identify Discrepancies • Include Uncertainties 2. Applications to High-Energy Physics 5 A. De Simone
> Outline 1. Statistical test of dataset compatibility • Nearest-Neighbors Two-Sample Test • Identify Discrepancies • Include Uncertainties 2. Applications to High-Energy Physics 6 A. De Simone
> Two-sample Test [a.k.a. “homogeneity test”] Two sets: iid Trial: T = { x 1 , . . . , x N T } ∼ p T i ∈ R D x i , x 0 iid Benchmark: B = { x 0 1 , . . . , x 0 N B } ∼ p B probability distributions p B ,p T unknown e.g.: simulated SM bkg real measured data 7 A. De Simone
> Two-sample Test Two sets: iid Trial: T = { x 1 , . . . , x N T } ∼ p T i ∈ R D x i , x 0 iid Benchmark: B = { x 0 1 , . . . , x 0 N B } ∼ p B probability distributions p B ,p T unknown Are B,T drawn from the same prob. distribution? easy… 8 A. De Simone
> Two-sample Test Two sets: iid Trial: T = { x 1 , . . . , x N T } ∼ p T i ∈ R D x i , x 0 iid Benchmark: B = { x 0 1 , . . . , x 0 N B } ∼ p B probability distributions p B ,p T unknown Are B,T drawn from the same prob. distribution? … hard! 9 A. De Simone
> Two-sample Test RECIPE: 1. Density Estimator reconstruct PDFs from samples 2. Test Statistic (TS) measure “distance” between PDFs 3. TS distribution associate probabilities to TS under null hypothesis H 0 : p B = p T 4. p -value accept/reject H 0 10 A. De Simone
> 1. Density Estimator Divide the space in squared bins? ✓ easy B ✓ can use simple statistics (e.g. ) χ 2 ✘ hard/slow/impossible in high- D Need un-binned multivariate approach p B ( x ) , ˆ ˆ p T ( x ) Find PDFs estimators : e.g. based on densities of points: T p B,T ( x ) = ρ B,T ( x ) ˆ N B,T Nearest Neighbors! [Schilling - 1986][Henze - 1988] [Wang et al. - 2005,2006] [Dasu et al. - 2006][Perez-Cruz - 2008] [Sugiyama et al. - 2011][Kremer et al, 2015] 11 A. De Simone
> 1. Density Estimator • Fix integer K. B • Choose query point x j in T and draw it in B. x j T x j 12 A. De Simone
> 1. Density Estimator • Fix integer K. B • Choose query point x j in T and draw it in B. x j r j,B • Find the distance r j,B of the K th -NN of x j in B. T x j 13 A. De Simone
> 1. Density Estimator • Fix integer K. B • Choose query point x j in T and draw it in B. x j r j,B • Find the distance r j,B of the K th -NN of x j in B. • Find the distance r j,T of the K th -NN of x j in T. T r j,T x j 14 A. De Simone
> 1. Density Estimator • Fix integer K. B • Choose query point x j in T and draw it in B. x j r j,B • Find the distance r j,B of the K th -NN of x j in B. • Find the distance r j,T of the K th -NN of x j in T. T • Estimate PDFs: r j,T x j 1 K p B ( x j ) ˆ = ω D r D N B j,B 1 K p T ( x j ) ˆ = ω D r D N T − 1 j,T 15 A. De Simone
> 2. Test Statistic • Measure of the “distance” between 2 PDFs N T 1 log ˆ p T ( x j ) X • Define Test Statistic : TS( B , T ) = p B ( x j ) ˆ N T (detect under-/over-densities) j =1 TS( B , T ) = ˆ • Related to Kullback-Leibler divergence as: D KL (ˆ p T || ˆ p B ) R D p ( x ) log p ( x ) Z D KL ( p || q ) ≡ q ( x ) d x N T TS obs = D log r j,B N B X • From NN-estimated PDFs: + log N T − 1 N T r j,T j =1 • Theorem: this estimator converges to D KL ( p B || p T ), in large sample limit [Wang et al. - 2005,2006] 16 A. De Simone
> 3. Test Statistic Distribution Permutation test! How is TS distributed? Assume p B =p T . Union set: U = T ∪ B T e U Random reshuffle T Compute the test statistic TS n on: ( ˜ B , ˜ T ) e B B Repeat many times. f (TS | H 0 ) ← { TS n } Distribution of TS under H 0 : [asymptotically normal with zero mean] 17 A. De Simone
> 4. p -value • mean,variance of TS distribution f (TS | H 0 ) µ, ˆ ˆ σ : TS → TS 0 ≡ TS − ˆ µ • Standardize the TS: ˆ σ f 0 (TS 0 | H 0 ) = ˆ σ TS 0 | H 0 ) σ f (ˆ µ + ˆ • TS’ distributed according to • Two-sided p -value: Z + 1 f 0 (TS 0 | H 0 ) d TS 0 p = 2 | TS 0 obs | • Equivalent significance: Z ≡ Φ − 1 (1 − p/ 2) 18 A. De Simone
> 2D Gaussian Example ✓ 1 ◆ 0 p B = N ( µ B , Σ B ) p T = N ( µ T , Σ T ) Σ B = Σ T = 0 1 ✓ 1 . 0 ◆ ✓ 1 . 2 ◆ µ B = µ T = 1 . 0 1 . 2 exact KL divergence ✓ 1 . 0 ◆ ✓ 1 . 15 ◆ µ B = µ T = 1 . 0 1 . 15 K = 5 , N perm = 1000 more data, more power 19 A. De Simone
> NN2ST: Summary INPUT: iid i ∈ R D x i , x 0 T = { x 1 , . . . , x N T } Trial sample: ∼ p T p B ,p T unknown iid Benchmark sample: B = { x 0 1 , . . . , x 0 N B } ∼ p B K : number of nearest neighbors N perm : number of permutations OUTPUT: p -value of the null hypothesis H 0 : p B = p T [check compatibility between 2 samples] 20 A. De Simone
> NN2ST: Summary Test Statistic Benchmark sample TS obs y t i s n n e o d i t a N m N i - t s K e o i t a r p Trial sample e r m u t a t i o n t e s t -|TS obs | |TS obs | p value TS distribution Python code: github.com/de-simone/NN2ST 21 A. De Simone
> Outline 1. Statistical test of dataset compatibility • Nearest-Neighbors Two-Sample Test • Identify Discrepancies • Include Uncertainties 2. Applications to High-Energy Physics 22 A. De Simone
> Where are the discrepancies? Bonus: Characterize regions with significant discrepancies Z ( x j ) ≡ u ( x j ) − ¯ u u ( x j ) ≡ log r j,B 1. “Score” field over T : with: r j,T σ u TS obs = D ¯ u + const Z x Z ( x ) > c 2. Identify points where They contribute the most to large TS obs high-discrepancy (anomalous) regions 3. Apply a clustering algorithm to group them 23 A. De Simone
> Outline 1. Statistical test of dataset compatibility • Nearest-Neighbors Two-Sample Test • Identify Discrepancies • Include Uncertainties 2. Applications to High-Energy Physics 24 A. De Simone
> Sample Uncertainties How to include sample uncertainties? B 1. Model feature uncertainties F B ( x ) , F T ( x ) [e.g. zero-mean gaussians] 2. New samples by adding random noise sampled from F B,T : { x i + ∆ x i } N T T u = i =1 T i } N B { x 0 i + ∆ x 0 B u = i =1 3. Compute TS on new samples TS u ≡ TS( B u , T u ) = TS obs + U 4. Repeat many times to reconstruct f(U) 25 A. De Simone
> Sample Uncertainties How to include sample uncertainties? • f(TS u ) is a convolution: f (TS u | H 0 ) = f (TS | H 0 ) ∗ f ( U ) f(TS u ) more spread than f(TS) • p -value computed from f(TS u ) • weaker significance, power degradation TS obs 26 A. De Simone
> 2D Gaussian with Uncertainties B,T gaussian samples: gaussian uncorrelated errors (diagonal covariance) p B = N ( µ B , Σ B ) p T = N ( µ T , Σ T ) with fixed relative uncertainty ✓ 1 . 0 ◆ ✓ 1 . 15 ◆ µ B = µ T = 1 . 0 1 . 15 � i = ✏ x i ✓ ◆ 1 0 for each feature component i Σ B = Σ T = 0 1 27 A. De Simone
> NN2ST: Summary ✓ general, model-independent ✓ fast, no optimization [ N B,T =20k, K =5, N perm =1k, D =2: t ~ 2 mins N B,T =20k, K =5, N perm =1k, D =8: t ~ 50 mins ] ✓ sensitive to unspecified signals ✓ useful when no variable can separate sig/bkg ✓ helps finding signal regions, optimal cuts, … ✓ flexible to incorporate uncertainties ✘ need to run for each sample pair ✘ permutation test is bottleneck 28 A. De Simone
Recommend
More recommend