Introduction to TMVA and Primary Electron Track Determination Erin - PowerPoint PPT Presentation

Introduction to TMVA and Primary Electron Track Determination Erin Conley SNB/LE Working Group Meeting June 20, 2018 6/20/2018 1

Introduction 30.25 MeV Event Display: Time (ticks) vs. Wire in 𝜉 𝑓 CC interactions • Goal: determine primary electron (reconstructed) track – Not always obvious; having a concrete, general method would be useful! – Using MARLEY events made by J. Stock in May 2017 • TMVA provides methods based on machine learning to help reach this goal. 6/20/2018 2

TMVA: Introduction • TMVA: Toolkit for Multivariate Data Analysis – Framework in ROOT to be used for classification and regression problems – Various multivariate analysis (MVA) methods available • Two independent phases: – Training phase: MVA methods trained, tested, evaluated – Application phase: chosen MVA methods applied to classification problem • Need to worry about overtraining: too few degrees of freedom leads to unrealistic increase in classification performance • Data pre-processing available to, e.g., de- correlate or “Gaussian - ize ” variables 6/20/2018 3

TMVA Output Characteristics about input variables: MVA method performance plots: • • Distributions for signal, background MVA method classifier outputs of thumb: want 𝜍 𝐿𝑇 ≳ 0.01 ) input variables – Kolmogorov-Smirnov test statistic to determine whether overtraining occurred (rule • Distributions for transformed variables (e.g., decorrelated variables) • Optimal cut for MVA method classifiers • Correlation plots + matrix to • Classification probabilities + PDFs understand linear correlations between • Probability integral transformation variables (rarity) • Receiver operation characteristics Use these plots to choose optimal (ROC) curves combinations of variables, data pre- processing strategy, etc. Use these plots to compare MVA method performances, choose optimal cuts on data, etc. 6/20/2018 4

MARLEY Simulations: Preparing for TMVA • Used BackTracker to determine which tracks were made by primary electron – Used 2D hits associated with tracks – Multiple tracks in the event can be made by the primary electron 10 hits → primary electron produced 6 hits) – Some tracks are partially made by primary electrons (e.g., track with • For the purposes of preliminary TMVA tests: – Signal: tracks that had 75% or more of its hits produced by the primary electron – Background: all other tracks • Used full 30.25 MeV MARLEY simulation (10,000 events) 6/20/2018 5

Determining Primary Track “By Eye” • Scanned event displays of 100 events in 30.25 MeV MARLEY data – These events had 2.61 reconstructed tracks on average (pmtracktc) • Out of the 100 events… – 2 events had no reconstructed tracks – 2 events I failed to identify the primary track – 10 events I correctly identified at least one primary track but… • Misidentified another track • Failed to identify all primary tracks – 86 events I correctly identified all primary tracks • Out of the 86 events where I was 100% correct… – 14 events contained one track 6/20/2018 6

Variables Used from MARLEY Simulations 1. Track length: as given in the recob::Track object “Charge deposition”: Sum of integral values of all hits in a track 2. – recob::Hit::Integral(): integral under calibrated signal waveform “Path time”: difference between max/min peak times in the track 3. – recob::Hit::PeakTime(): time of signal peak (ticks) “Summed RMS”: sum of RMS of all hits in a track 4. – recob::Hit::RMS(): RMS of hit shape (ticks) • Also used calorimetry information from tracks: “Summed dQdx ”: sum of dQdx values on collection plane 5. “ Calo KE”: kinetic energy of track on collection plane 6. – Potential issue: not all tracks have calorimetry information (bug?) 6/20/2018 7

Signal: tracks with 75-100% of their Input Variable Distributions hits made by primary electron Background: all other tracks Input variable: length Input variable: timeofint Input variable: chargedepo 0.0014 0.702 4.08 290 Signal 0.1 0.6 Background 0.0012 / U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% / (1/N) dN 0.08 / (1/N) dN 0.5 (1/N) dN 0.001 0.4 0.06 0.0008 0.3 0.0006 0.04 0.2 0.0004 0.02 0.1 0.0002 0 0 0 5 10 15 20 25 20 40 60 80 100 120 140 160 2000 4000 6000 8000 10000 length chargedepo timeofint Input variable: summedrms Input variable: summeddqdx Input variable: caloke 203 0.002 4.42 5.28 0.12 0.16 0.0018 / U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.1, 0.0)% / / (1/N) dN 0.0016 0.14 (1/N) dN (1/N) dN 0.1 0.0014 0.12 0.08 0.0012 0.1 0.001 0.06 0.08 0.0008 0.06 0.04 0.0006 0.04 0.0004 0.02 0.02 0.0002 0 0 0 20 40 60 80 100 120 140 160 180 1000 2000 3000 4000 5000 6000 7000 8000 20 40 60 80 100 120 140 160 180 200 summeddqdx summedrms caloke 6/20/2018 8

Decorrelated Variable Distributions Input variable’Deco’-transformed : length Input variable’Deco’-transformed : timeofint Input variable’Deco’-transformed : chargedepo 0.284 0.316 0.189 Signal 3 1.6 1.2 Background 1.4 U/O-flow (S,B): (0.0, 0.0)% / (0.2, 0.1)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% 2.5 / / / 1 (1/N) dN (1/N) dN (1/N) dN 1.2 2 0.8 1 1.5 0.8 0.6 0.6 1 0.4 0.4 0.5 0.2 0.2 0 0 0 - - 2 0 2 4 6 8 2 0 2 4 6 8 0 1 2 3 4 5 6 7 length (Deco) timeofint (Deco) chargedepo (Deco) Input variable’Deco’-transformed : caloke I nput variable’Deco’-transformed : summedrms n put variable’Deco’-transformed : summeddqdx 0.221 0.285 0.307 4.5 1.4 1.4 4 U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% 1.2 U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.1, 0.0)% / / / 1.2 3.5 (1/N) dN (1/N) dN (1/N) dN 1 3 1 0.8 2.5 0.8 2 0.6 0.6 1.5 0.4 0.4 1 0.2 0.2 0.5 0 0 0 - - - - 4 2 0 2 4 6 4 2 0 2 4 6 8 0 1 2 3 4 5 6 7 8 summedrms (Deco) summeddqdx (Deco) caloke (Deco) 6/20/2018 9

Track Determination + TMVA • Number “signal” events: 10942 • Number “background” events: 8736 • Trained on 8442 signal, 6236 background events; tested on 2500 signal, 2500 background events – Tried to minimize testing sample; the more training, the better! • Tested ~5 different MVA methods so far, including cut-based analysis, likelihood estimator, boosted decision trees – TMVA ranks MVA methods by best signal efficiency – Use ROC curve to determine MVA performance – Will only show BDT results (cut-based, likelihood results in backup) 6/20/2018 10

Boosted Decision Tree (BDT) Method • Structured like binary tree; “yes/no” decisions taken on one variable at a time until stop regions → eventually classified criterion reached – Splits phase space into many trees → “forest” as signal or background – Boosted: extends to several • Purposes of track determination: BDT with Schematic of decision tree: leaf nodes at bottom decorrelated variables are labeled “signal” and “background” after binary splits are made; these labels depend on the majority of events that end up in nodes 6/20/2018 11

ROC Curve for TMVA Methods Background rejection versus Signal efficiency • Shows true positive rate 1 Background rejection versus false positive 0.9 rate for different ROC integral values: • possible cutoff points BDT: 0.945 0.8 • • Likelihood: 0.942 Use ROC curve to • Cuts: 0.934 compare MVA 0.7 performances 0.6 • The larger the area/ integral, the better the 0.5 performance MVA Method: • From the integral 0.4 BDT values, we see that Likelihood MVA methods are 0.3 Cuts comparable, BDT is 0.2 performing well! 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Signal efficiency 6/20/2018 12

BDT Classifier Output + Cuts TMVA overtraining check for classifier: BDT Cut efficiencies and optimal cut value Signal purity 3.5 Signal efficiency dx Signal (test sample) Signal (training sample) Signal efficiency*purity Background efficiency / S/ S+B Background (test sample) Background (training sample) (1/N) dN Efficiency (Purity) Significance 30 3 Kolmogorov-Smirnov test: signal (background) probability = 0.055 (0.032) 1 25 2.5 0.8 U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% 20 2 0.6 1.5 15 0.4 1 10 0.2 For 1000 signal and 1000 background 0.5 5 events the maximum S/ S+B is 27.77 when cutting at -0.06 0 0 0 - - - - - - 0.6 0.4 0.2 0 0.2 0.4 0.6 0.6 0.4 0.2 0 0.2 0.4 0.6 BDT response Cut value applied on BDT output • TMVA convention: signal events at larger classifier • Gives us an idea of our performance when we values, background at smaller apply BDT to other datasets (e.g., future • Note the KS test statistics are above 0.01; indicates no MARLEY simulations) overtraining occurred 6/20/2018 13

Introduction to TMVA and Primary Electron Track Determination Erin - PowerPoint PPT Presentation

Introduction to TMVA and Primary Electron Track Determination Erin Conley SNB/LE Working Group Meeting June 20, 2018 6/20/2018 1 Introduction 30.25 MeV Event Display: Time (ticks) vs. Wire in CC interactions Goal: determine

Deep learning in TMVA Benchmarking TMVA DNN Integration of a Deep Autoencoder Marc Huwiler CERN

Electron Ionization (EI) Electron Ionization (EI) Electron Ionization (EI) Electron Ionization

Particle identification using TMVA/MLP and Nave Bayes for EMC detector Malgorzata

Photon Not Meeting 27 th July 2017 1 TMVA Classification Can now extract the response variable

Electron Cooling Electron Cooling Plans for future electron cooling needs PS BD/AC 25 th January

Electron Cloud Build Electron Cloud Build- Electron Cloud Build Electron Cloud Build -Up

Status on positron fraction Multi-track event CC fitted Multi-track event 1 track Multi-Track

FYS5310/FYS9320 Electron Microscopy, Electron Diffraction and Spectroscopy II Spring 2018

Electron Beam-Induced Contamination Electron Beam-Induced Contamination in the Scanning Electron

Material Density Electron Distribution in SEY Layer Band Gap Primary Electron

tSURFF - Photo-Electron Emission tSURFF - Photo-Electron Emission from from One-, Two- and

Energetic electron motion in the geomagnetic field Energetic electron motion in the geomagnetic

Electron correlation That part of electron electron interaction, which is not included in

Simple ML Tutorial Mike Williams MIT June 16, 2017 Machine Learning ROOT provides a C++

TMVA Exercise Crist ov ao Beir ao da Cruz e Silva Instituto Superior T ecnico,

Integration of Spark parallelization in TMVA Georgios Douzas Enric Tejedor, Sergei Gleyzer,

Perceptrons Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction: Neural Networks The

5: Overtraining and Cross-validation Machine Learning and Real-world Data Paula Buttery (based

Do You Need a Lawyer? 1 You are a recent Richmond physics grad- uate and get this cool job

Strategic Initiatives Conservation and Water Resource Enhancement Revitalizing River Towns

Generative Adversarial Networks (part 2) Benjamin Striner 1 1 Carnegie Mellon University April 22,

On Dangers of Overtraining Steganography to Incomplete Cover Model Jan Kodovsk, Jessica

Statistical Tools in Collider Experiments Multivariate analysis in high energy physics Lecture 3

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data