experience with rdataframe
play

Experience with RDataFrame Spotlight on interactive/exploratory use - PowerPoint PPT Presentation

Experience with RDataFrame Spotlight on interactive/exploratory use Oliver Lantwin [ oliver.lantwin@cern.ch ] December 6, 2018 Intro Who am I? Physicist (PhD student) working on the SHiP experiment, and its software Future experiment,


  1. Experience with RDataFrame Spotlight on interactive/exploratory use Oliver Lantwin [ oliver.lantwin@cern.ch ] December 6, 2018

  2. Intro

  3. Who am I? › Physicist (PhD student) working on the SHiP experiment, and its software › Future experiment, so work includes: › Physics studies using toy simulation › Full simulation using Geant4 and co › TGeo › Reconstruction, Digitisation › Analysis › small dedicated software for test beams → Plenty of places to play around with new ROOT features without breaking anything › Personally, I have used both C++ and Python , as well as the scientifjc python packages extensively (including pandas ) Oliver Lantwin (Imperial College London) December 6, 2018 Intro 2

  4. Why am I giving this talk? › Been playing around with RDataFrame a lot recently: › Dedicated small physics studies › Procrastination › In contact with Enrico, Danilo &co. with bug reports, suggestions &c.: › RVec<T>::operator== etc. return RVec<int> instead of RVec<bool> ? → Fixed › SaveGraph should return the dot spec by default → Fixed (demo to follow) › Automatic histogram naming (for the lazy physicist) › redefjnition of columns causes seg violation → Fixed (now useful error message) Oliver Lantwin (Imperial College London) December 6, 2018 Intro 3

  5. Killer feature: Toy MC

  6. TGenPhaseSpace sans RDF weight = event.Generate(); Killer feature: Toy MC December 6, 2018 Oliver Lantwin (Imperial College London) f->Close(); tree.Write(); } tree.Fill(); nul = *event.GetDecay(2); l = *event.GetDecay(1); pi = *event.GetDecay(0); for ( auto && i: ROOT::MakeSeq(1000000)){ TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; tree.Branch(”w”, &weight); tree.Branch(”nul”, &nul); tree.Branch(”l”, &l); tree.Branch(”pi”, &pi); TTree tree(”phasespace”, ”K_L phasespace”); TLorentzVector pi, l, nul; double weight; auto f = TFile::Open(”K_L_phasespace.root”, ”recreate”); double masses[] = {m_pi, m_mu, m_numu}; TGenPhaseSpace event; 4 event.SetDecay(K_L, 3, masses);

  7. With RDF phasespace.Snapshot(”phasespace”, ”K_L_phasespace.root”); Killer feature: Toy MC December 6, 2018 Oliver Lantwin (Imperial College London) › no TTree boilerplate › more expressive › Much shorter Pro: .Define(”nu”, ”*event.GetDecay(2)”); TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; .Define(”l”, ”*event.GetDecay(1)”) .Define(”pi”, ”*event.GetDecay(0)”) .Define(”weight”, ”event.Generate()”) auto phasespace = df ROOT::RDataFrame df(100000); double masses[] = {m_pi, m_mu, m_numu}; TGenPhaseSpace event; 5 event.SetDecay(K_L, 3, masses);

  8. With RDF phasespace.Snapshot(”phasespace”, ”K_L_phasespace.root”); Killer feature: Toy MC December 6, 2018 Oliver Lantwin (Imperial College London) with slot › Not parallelisable ( event object) problematic (would need explicit lambda for each Define › Not obvious to me that Define s always happen in order Reliant on side effects and implementation details: Con: .Define(”nu”, ”*event.GetDecay(2)”); TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; .Define(”l”, ”*event.GetDecay(1)”) .Define(”pi”, ”*event.GetDecay(0)”) .Define(”weight”, ”event.Generate()”) auto phasespace = df ROOT::RDataFrame df(100000); double masses[] = {m_pi, m_mu, m_numu}; TGenPhaseSpace event; 6 event.SetDecay(K_L, 3, masses);

  9. Maybe future RDF ? (better suggestions welcome!) } Killer feature: Toy MC December 6, 2018 Oliver Lantwin (Imperial College London) › similar to ROOT-9766 › easily parallelisable using array of TGenPhaseSpace and (Multi)DefineSlot . › (nearly) purely functional phasespace.Snapshot(”phasespace”, ”K_L_phasespace.root”); .MultiDefine({”weight”, ”pi”, ”l”, ”nu”}, genFunc); auto phasespace = df return make_tuple(weight, pi, l, nu) TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; auto nu = *event.GetDecay(1); auto l = *event.GetDecay(1); auto pi = *event.GetDecay(0); auto weight = event.Generate(); auto genFunc = [event](){ ROOT::RDataFrame df(100000); double masses[] = {m_pi, m_mu, m_numu}; TGenPhaseSpace event; 7 event.SetDecay(K_L, 3, masses);

  10. Toy MC in RDF › Trivial to adapt to most simple generation routines (inspired by Enrico’s Pythia8 example) › Can Snapshot and continue analysing right away (e.g. in notebook), no need to context switch and can easily change toy MC parameters etc. and rerun entire chain! › Code above is a real example recently used for an estimation Oliver Lantwin (Imperial College London) December 6, 2018 Killer feature: Toy MC 8

  11. Nasty nested loops

  12. SHiP simulation data model File Size 32000 bytes Compression= 1.91 * *............................................................................* *Br 1 :MCTrack.fUniqueID : UInt_t fUniqueID[cbmroot.Stack.MCTrack_] * *Entries : 100 : Total Size= 27425 bytes = *Baskets : 548 * *Baskets : 1 : Basket Size= 32000 bytes Compression= 48.79 * *............................................................................* ... TBrowser Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops 1 : Basket Size= 463 * TTree of TClonesArray s of C++ classes with scalar member variables * ROOT-9674 ****************************************************************************** *Tree :cbmsim : /cbmroot * *Entries : 100 : Total = 4199856 bytes File Size = 1868961 * : = : Tree compression factor = 2.20 * ****************************************************************************** *Br 0 :MCTrack : Int_t cbmroot.Stack.MCTrack_ * *Entries : 100 : Total Size= 11377 bytes File Size 9 › TClonesArray member are split, but types are not correctly identifjed →

  13. Nasty Nested Loops This data model is easy to work with using a looping approach: for event in tree: for p in particles: do_stuff() But pandas , uproot and RDataFrame struggle with it. Been brainstorming with Enrico, Jim etc. on how this could be improved in general (see also https://github.com/bluehood/NNLOops, https://github.com/scikit-hep/awkward-array) Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops 10

  14. Example: PyROOT h = TH2D(”h”, ”;P[GeV/c];p_{T}[GeV/c]”, 100, 0, 100, 100, 0, 3) for track in tracks: if abs(track.GetPdgCode() == 13): pt = hypot(track.GetPx(), track.GetPy()) h.Fill(p, pT) Pro: Easy to understand Con: Pure python loops Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops 11 p = hypot(track.GetPz(), pt)

  15. TTree::Draw() ? ”>>h(100, 0, 100, 100, 0, 3)”, Nasty nested loops December 6, 2018 Oliver Lantwin (Imperial College London) Con: Not true C++ , very fragile, doesn’t work for more complicated examples Pro: Faster, still fairly obvious to ROOT users what happens ”abs(MCTrack.GetPdgCode())==13”, ”colz”) ”+MCTrack.GetPz()*MCTrack.GetPz())” tree->Draw(”sqrt(” ”+MCTrack.GetPy()*MCTrack.GetPy()” ”MCTrack.GetPx()*MCTrack.GetPx()” ”:sqrt(” ”)” ”+MCTrack.GetPy()*MCTrack.GetPy()” ”MCTrack.GetPx()*MCTrack.GetPx()” 12

  16. Contortions with RDF return Map(muons, [](const ShipMCTrack& muon){ Nasty nested loops December 6, 2018 Oliver Lantwin (Imperial College London) Con: A lot of repetition, boiler plate; colleagues less happy with C++ will complain Pro: Fast , Safe .Histo2D({”h”, ”;P[GeV/c];p_{T}[GeV/c]”, 100, 0, 100, 100, 0, 3},”Muons_P”,”Muons_pt”); },{”Muons”}) return sqrt( muon.GetPx()*muon.GetPx()+ muon.GetPy()*muon.GetPy()+ muon.GetPz()*muon.GetPz()); }); .Define(”Muons_P”, [](const RVec<ShipMCTrack> &muons){ auto h = df },{”Muons”}) return sqrt( muon.GetPx()*muon.GetPx()+ muon.GetPy()*muon.GetPy()); }); return Map(muons, [](const ShipMCTrack& muon){ .Define(”Muons_pt”, [](const RVec<ShipMCTrack> &muons){ }, {”Tracks”}) return abs(t.GetPdgCode())==13; }); return Filter(ts, [](const ShipMCTrack& t){ .Define(”Muons”, [](const RVec<ShipMCTrack> & ts){ .Define(”Tracks”, clones_converter<ShipMCTrack>, {”MCTrack”}) 13

  17. Why this way? › Operating on RVec<ShipMCTrack> allows us to use Filter s, Map s and track[isMuon] niceties, but causes trouble when we want to use getter methods etc. › Alternative approach: › Use split feature to get scalar properties as RVec s directly › Use RVec columnar calculations Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops 14

  18. Second try Caveat: Does not yet work auto h = df.Define(”isMuon”, ”abs(MCTrack.fPdgCode)==13”) .Define(”pt”, ”sqrt(MCTrack.fPx*MCTrack.fPx+MCTrack.fPy+MCTrack.fPy)”) .Define(”P”, ”sqrt(pt*pt+MCTrack.fPz+MCTrack.fPz)”) .Define(”muon_P”, ”P[isMuon]”) .Define(”muon_pt”, ”pt[isMuon]”) .Histo2D({”h”, ”;P[GeV/c];p_{T}[GeV/c]”, 100, 0, 100, 100, 0, 3}, ”muon_P”,”muon_pt”); Pro: Much less verbose Con: Would have to read split TClonesArray branches into RVec , ignore class member visibility ( ROOT seems to do this in many cases already) Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops 15

Recommend


More recommend