scikit learn to tmva xml converter tool
play

scikit-learn to TMVA: XML converter tool Yuriy Ilchenko (U. of - PowerPoint PPT Presentation

scikit-learn to TMVA: XML converter tool Yuriy Ilchenko (U. of Texas), Nazim Huseynov (JINR) IML LHC Machine Learning WG Meeting Feb 03, 2015 History ttbar production with non-prompt leptons - major background for a few ttH channels


  1. scikit-learn to TMVA: XML converter tool Yuriy Ilchenko (U. of Texas), Nazim Huseynov (JINR) IML LHC Machine Learning WG Meeting Feb 03, 2015

  2. History • ttbar production with non-prompt leptons - major background for a few ttH channels • Idea is to use MVA - boosted decision tree (BDT) - to separate prompt from non-prompt leptons • Employ TMVA from ROOT • List of input variables - object level only • pt, eta, sigd0PV, z0SinTheta, etcone20/pt, ptcone20/pt • С ompare BDT performance against the standard analysis cuts • ROC-curve (BDT) vs a point (cuts) 2

  3. TMVA - electrons 10% sample 33% sample Background rejection versus Signal efficiency Background rejection versus Signal efficiency 1 1 Background rejection Background rejection 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 MVA Method: MVA Method: 0.3 0.3 BDT BDT 0.2 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Signal efficiency Signal efficiency zoom in zoom in Background rejection versus Signal efficiency Background rejection versus Signal efficiency 1 1 Background rejection Background rejection cuts 0.98 0.95 0.96 0.9 0.94 0.85 0.92 0.8 0.9 0.75 0.88 0.7 MVA Method: MVA Method: 0.86 0.65 BDT BDT 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.84 0.86 0.88 0.9 0.92 Signal efficiency Signal efficiency 3

  4. TMVA - muons 10% sample 33% sample Background rejection versus Signal efficiency 1 Background rejection --- <ERROR> BDT : YOUR tree has 0.9 only 1 Node... kind of a funny 0.8 *tree*. I cannot boost such a 0.7 thing... if after 1 step the error rate 0.6 is == 0.5 0.5 0.4 MVA Method: no results :( 0.3 BDT 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Signal efficiency zoom in Background rejection versus Signal efficiency cuts 1 Background rejection Decided to try an 0.99 alternative MVA 0.98 library 0.97 0.96 MVA Method: BDT 0.95 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 4 Signal efficiency

  5. scikit-learn • “sklearn” - popular open-source library for data- analysis written in python • Implements all major models - decision trees, neural networks, etc • Supported by an international community of developers Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. www.scikit-learn.org 5

  6. sklearn - electrons 10% sample 33% sample zoom in zoom in cuts 6

  7. sklearn - muons 10% sample 33% sample zoom in zoom in cuts 7

  8. sklearn to TMVA • Problem: No sklearn available in ATLAS software • Solution: convert a classifier trained with scikit-learn to the xml format readable by TMVA Reader • Perk: apply BDT in ATLAS independently of scikit-learn skTMVA converter For Training For Testing 8

  9. skTMVA converter • skTMVA - sklearn to TMVA converter • part of koza4ok package: contained ROC-curve calculation, some other tools • written in python • @GitHub - https://github.com/yuraic/koza4ok • What’s supported? • BDT binary classification • AdaBoost, Grad Boosting • xml format only 9

  10. skTMVA in action • Getting the converter git clone https://github.com/yuraic/koza4ok.git • Setup the repository source setup_koza4ok.sh • And in your python code scikit-learn model output TMVA xml file TMVA input variables and their type (variable order matters!) 10

  11. skTMVA in practice In koza4ok/example folder • Training - no input data is required, data • is generated on fly bdt_sklearn_to_tmva_AdaBoost.py • bdt_sklearn_to_tmva_Grad.py • Testing and Validation- draw ROC curve • by TMVA and scikit-learn and overlay validate_sklearn_to_tmva.py • Two files created when running examples bdt_sklearn_to_tmva_example.pkl • stores scikit-learn model - bdt_sklearn_to_tmva_example.xml • converted TMVA xml file - 11

  12. Summary Summary • skTMVA - scikit-learn to TMVA converter • supports BDT binary classification - AdaBoost, Gradient Boosting • saves to xml file • comes with examples and validation code • web: https://github.com/yuraic/koza4ok Plans • Convert scikit-learn model to a standalone C++ file • Contact us • Yuriy Ilchenko (core development) - ilchenko@physics.utexas.edu • Nazim Huseynov (validation, testing) - nguseynov@jinr.ru 12

  13. Backup 13

  14. Decision Tree in scikit-learn and TMVA TMVA variable description in • back-up slides (or google) sklearn tree structure is • http://scikit-learn.org/dev/ auto_examples/tree/ unveil_tree_structure.html scikit-learn Decision Tree apply skTMVA converter 14

  15. TMVA minimal weights xml Describe Variables Maps var to VarIndex Tree weight (AdaBoost) Tree number Tree structure as a bunch of included nodes Example: a single tree encoded in TMVA xml file <GeneralInfo> and <Options> - removed, don’t affect BDT score 15

  16. TMVA BDT xml parameters • Variables section • variable Min, Max values show no effect on output BDT score • BinaryTree section - node parameters • IVar=“0" - refers to a variable defined by VarIndex in the Variables section • pos=“s” - root node, pos=“l”- left, pos=“r” - right • Cut=“3.4095886230468750e+01" - node cut value • nType - node type; compared against NodePurityLimit which is set in configuration - TMVA BDT config parameters • nType=“-1" - terminal background node • nType=“1" - terminal signal node • nType=“0" - intermediate node • cType - cut type • cType=“0" - if node variable > cut value, then go left; otherwise - right • cType=“1" - if node variable > cut value, then go right; otherwise - left • purity - S/(S+B); S - number of signal events, B - number of background events • res=“…” and rms=“…” - regression predictions (used in Gradient Boosting) • NCoef=“0" - always zero, some Fisher coefficients, not sure what they are for 16

Recommend


More recommend