Introduction to TMVA and Primary Electron Track Determination Erin Conley SNB/LE Working Group Meeting June 20, 2018 6/20/2018 1
Introduction 30.25 MeV Event Display: Time (ticks) vs. Wire in 𝜉 𝑓 CC interactions • Goal: determine primary electron (reconstructed) track – Not always obvious; having a concrete, general method would be useful! – Using MARLEY events made by J. Stock in May 2017 • TMVA provides methods based on machine learning to help reach this goal. 6/20/2018 2
TMVA: Introduction • TMVA: Toolkit for Multivariate Data Analysis – Framework in ROOT to be used for classification and regression problems – Various multivariate analysis (MVA) methods available • Two independent phases: – Training phase: MVA methods trained, tested, evaluated – Application phase: chosen MVA methods applied to classification problem • Need to worry about overtraining: too few degrees of freedom leads to unrealistic increase in classification performance • Data pre-processing available to, e.g., de- correlate or “Gaussian - ize ” variables 6/20/2018 3
TMVA Output Characteristics about input variables: MVA method performance plots: • • Distributions for signal, background MVA method classifier outputs of thumb: want 𝜍 𝐿𝑇 ≳ 0.01 ) input variables – Kolmogorov-Smirnov test statistic to determine whether overtraining occurred (rule • Distributions for transformed variables (e.g., decorrelated variables) • Optimal cut for MVA method classifiers • Correlation plots + matrix to • Classification probabilities + PDFs understand linear correlations between • Probability integral transformation variables (rarity) • Receiver operation characteristics Use these plots to choose optimal (ROC) curves combinations of variables, data pre- processing strategy, etc. Use these plots to compare MVA method performances, choose optimal cuts on data, etc. 6/20/2018 4
MARLEY Simulations: Preparing for TMVA • Used BackTracker to determine which tracks were made by primary electron – Used 2D hits associated with tracks – Multiple tracks in the event can be made by the primary electron 10 hits → primary electron produced 6 hits) – Some tracks are partially made by primary electrons (e.g., track with • For the purposes of preliminary TMVA tests: – Signal: tracks that had 75% or more of its hits produced by the primary electron – Background: all other tracks • Used full 30.25 MeV MARLEY simulation (10,000 events) 6/20/2018 5
Determining Primary Track “By Eye” • Scanned event displays of 100 events in 30.25 MeV MARLEY data – These events had 2.61 reconstructed tracks on average (pmtracktc) • Out of the 100 events… – 2 events had no reconstructed tracks – 2 events I failed to identify the primary track – 10 events I correctly identified at least one primary track but… • Misidentified another track • Failed to identify all primary tracks – 86 events I correctly identified all primary tracks • Out of the 86 events where I was 100% correct… – 14 events contained one track 6/20/2018 6
Variables Used from MARLEY Simulations 1. Track length: as given in the recob::Track object “Charge deposition”: Sum of integral values of all hits in a track 2. – recob::Hit::Integral(): integral under calibrated signal waveform “Path time”: difference between max/min peak times in the track 3. – recob::Hit::PeakTime(): time of signal peak (ticks) “Summed RMS”: sum of RMS of all hits in a track 4. – recob::Hit::RMS(): RMS of hit shape (ticks) • Also used calorimetry information from tracks: “Summed dQdx ”: sum of dQdx values on collection plane 5. “ Calo KE”: kinetic energy of track on collection plane 6. – Potential issue: not all tracks have calorimetry information (bug?) 6/20/2018 7
Signal: tracks with 75-100% of their Input Variable Distributions hits made by primary electron Background: all other tracks Input variable: length Input variable: timeofint Input variable: chargedepo 0.0014 0.702 4.08 290 Signal 0.1 0.6 Background 0.0012 / U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% / (1/N) dN 0.08 / (1/N) dN 0.5 (1/N) dN 0.001 0.4 0.06 0.0008 0.3 0.0006 0.04 0.2 0.0004 0.02 0.1 0.0002 0 0 0 5 10 15 20 25 20 40 60 80 100 120 140 160 2000 4000 6000 8000 10000 length chargedepo timeofint Input variable: summedrms Input variable: summeddqdx Input variable: caloke 203 0.002 4.42 5.28 0.12 0.16 0.0018 / U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.1, 0.0)% / / (1/N) dN 0.0016 0.14 (1/N) dN (1/N) dN 0.1 0.0014 0.12 0.08 0.0012 0.1 0.001 0.06 0.08 0.0008 0.06 0.04 0.0006 0.04 0.0004 0.02 0.02 0.0002 0 0 0 20 40 60 80 100 120 140 160 180 1000 2000 3000 4000 5000 6000 7000 8000 20 40 60 80 100 120 140 160 180 200 summeddqdx summedrms caloke 6/20/2018 8
Decorrelated Variable Distributions Input variable’Deco’-transformed : length Input variable’Deco’-transformed : timeofint Input variable’Deco’-transformed : chargedepo 0.284 0.316 0.189 Signal 3 1.6 1.2 Background 1.4 U/O-flow (S,B): (0.0, 0.0)% / (0.2, 0.1)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% 2.5 / / / 1 (1/N) dN (1/N) dN (1/N) dN 1.2 2 0.8 1 1.5 0.8 0.6 0.6 1 0.4 0.4 0.5 0.2 0.2 0 0 0 - - 2 0 2 4 6 8 2 0 2 4 6 8 0 1 2 3 4 5 6 7 length (Deco) timeofint (Deco) chargedepo (Deco) Input variable’Deco’-transformed : caloke I nput variable’Deco’-transformed : summedrms n put variable’Deco’-transformed : summeddqdx 0.221 0.285 0.307 4.5 1.4 1.4 4 U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% 1.2 U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.1, 0.0)% / / / 1.2 3.5 (1/N) dN (1/N) dN (1/N) dN 1 3 1 0.8 2.5 0.8 2 0.6 0.6 1.5 0.4 0.4 1 0.2 0.2 0.5 0 0 0 - - - - 4 2 0 2 4 6 4 2 0 2 4 6 8 0 1 2 3 4 5 6 7 8 summedrms (Deco) summeddqdx (Deco) caloke (Deco) 6/20/2018 9
Track Determination + TMVA • Number “signal” events: 10942 • Number “background” events: 8736 • Trained on 8442 signal, 6236 background events; tested on 2500 signal, 2500 background events – Tried to minimize testing sample; the more training, the better! • Tested ~5 different MVA methods so far, including cut-based analysis, likelihood estimator, boosted decision trees – TMVA ranks MVA methods by best signal efficiency – Use ROC curve to determine MVA performance – Will only show BDT results (cut-based, likelihood results in backup) 6/20/2018 10
Boosted Decision Tree (BDT) Method • Structured like binary tree; “yes/no” decisions taken on one variable at a time until stop regions → eventually classified criterion reached – Splits phase space into many trees → “forest” as signal or background – Boosted: extends to several • Purposes of track determination: BDT with Schematic of decision tree: leaf nodes at bottom decorrelated variables are labeled “signal” and “background” after binary splits are made; these labels depend on the majority of events that end up in nodes 6/20/2018 11
ROC Curve for TMVA Methods Background rejection versus Signal efficiency • Shows true positive rate 1 Background rejection versus false positive 0.9 rate for different ROC integral values: • possible cutoff points BDT: 0.945 0.8 • • Likelihood: 0.942 Use ROC curve to • Cuts: 0.934 compare MVA 0.7 performances 0.6 • The larger the area/ integral, the better the 0.5 performance MVA Method: • From the integral 0.4 BDT values, we see that Likelihood MVA methods are 0.3 Cuts comparable, BDT is 0.2 performing well! 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Signal efficiency 6/20/2018 12
BDT Classifier Output + Cuts TMVA overtraining check for classifier: BDT Cut efficiencies and optimal cut value Signal purity 3.5 Signal efficiency dx Signal (test sample) Signal (training sample) Signal efficiency*purity Background efficiency / S/ S+B Background (test sample) Background (training sample) (1/N) dN Efficiency (Purity) Significance 30 3 Kolmogorov-Smirnov test: signal (background) probability = 0.055 (0.032) 1 25 2.5 0.8 U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% 20 2 0.6 1.5 15 0.4 1 10 0.2 For 1000 signal and 1000 background 0.5 5 events the maximum S/ S+B is 27.77 when cutting at -0.06 0 0 0 - - - - - - 0.6 0.4 0.2 0 0.2 0.4 0.6 0.6 0.4 0.2 0 0.2 0.4 0.6 BDT response Cut value applied on BDT output • TMVA convention: signal events at larger classifier • Gives us an idea of our performance when we values, background at smaller apply BDT to other datasets (e.g., future • Note the KS test statistics are above 0.01; indicates no MARLEY simulations) overtraining occurred 6/20/2018 13
Recommend
More recommend