automated parallel calculation of collaborative
play

Automated Parallel Calculation of Collaborative Statistical Models - PowerPoint PPT Presentation

Automated Parallel Calculation of Collaborative Statistical Models in RooFit Patrick Bos IEEE eScience, Amsterdam, 31 October 2018 Automated Parallel Computation of Collaborative Statistical Models rke (PI), Vince Cro Physics: Wouter Verk rkerk


  1. Automated Parallel Calculation of Collaborative Statistical Models in RooFit Patrick Bos IEEE eScience, Amsterdam, 31 October 2018

  2. Automated Parallel Computation of Collaborative Statistical Models rke (PI), Vince Cro Physics: Wouter Verk rkerk roft , Carsten Burga Burgard rd Bos (yours truly), Inti Pel eScience: Patrick Bo Pelupessy essy , Jisk Attema ma

  3. Particle physics

  4. High energy proton collisions Ima mages fr from ht http:/ ://atlas.physicsma masterclasses.org

  5. CERN Large Hadron Collider LHC @ CERN • ATLAS, CMS • LHCb 10 PB/yr p-p Reduced to kB- MBs binned & unbinned events

  6. Research questions Higgs properties Physics beyond the Standard Model • Supersymmetry? • Dark matter? • …

  7. RooFit: Collaborative Statistical Modeling

  8. Collaborative Statistical Modeling • RooFit: build models together • Teams 10-100 physicists • Collaborations ~3000 à ~100 teams • Ex Exasca scale co collabo aborat atio ion 15 sy 10 3 brain 18 (ex 10 15 10 synaptic co c connect ections x s x 1 brains = = 10 10 18 exa) • • 1 goal • Pretty impressive to an outsider

  9. Collaborative Statistical Modeling with RooFit Higgs @ ATLAS 20k+ nodes, 125k hours Expression tree of C++ objects for mathematical components (variables, operators, functions, integrals, datasets, etc.) Couple with data, event “observables” Making RooFit faster (~30x; ~h à ~m) • More ef effic icient ient collaboratio tion Faster iteration/debugging • Faster feedback between teams • • Next level physics modeling ambitions, e e retaining int inter eractiv ive e workfl kflow x x a a s s c c a a l l e e c c 1. Complex likelihood models, e.g. o o m m p p l l e e x x a) Higgs fit to all channels, ~200 datasets, O(1000) i i t t y y ! ! parameter, now O(few) hours b) EFT framework: again 10-100x more expensive 2. Unbinned ML fits with very large data samples 3. Unbinned ML fits with MC-style numeric integrals

  10. Goals and Design: Make fitting in RooFit faster using automated parallel calculation

  11. Making fitting in RooFit faster: how? Serial: benchmarks show no obvious bottlenecks RooFit already highly optimized (pre-calculation/memoization, MPFE) Parallel

  12. Fitting method: Quasi-Newton minimization (+ handle multiple minima, physics allowed ranges, confidence intervals) Minuit : minimize PDF !(#; %) : • Quasi-Newton MIGRAD method Left: Newton Right: gradient descent • Gradient + line-search: '( ( )+') +(()) '( • gradient for N parameters % : ') ≈ 2N ! calls à parallelize 2N ') ') 2-3 ! calls à parallelize ! • line-search: descend along gradient direction

  13. Faster fitting: (how) can we do it? likelihood: events Levels of parallelism “Vector” 1. Gradient (parameter partial derivatives) in minimizer likelihood: (unequal) 2. Likelihood ( ! ) components 3. Integrals (normalization) & other expensive shared components integrals etc.

  14. Faster fitting: (how) can we do it? Heteroge He geneous : sizes, types likelihood: events • Multiple strategies • How to split up? Small components à ne need low low • la late tency ncy/ove overhe head likelihood: Large components as well… • (unequal) components Run time depends on optimizations, • differs per parameter, hard to predict • How to divide over cores? Load balancing à ta task sk-bas based ed • appr approac ach: wo work ste steali ling ng • … both for likelihood-level and integrals etc. gradient-level parallelization

  15. Design: MultiProcess task-stealing framework Job tasks Task-stealing, worker pool, executes Jo loop : Wor Worker lo !" Job = likelihood component, !# , … Worker requests No threads, process-based: Job task BidirMMapPipe handles fork, mmap, pipes Worker Queue sends pops task result Queue Worker Worker 1 1 ↻ ipc pipe pipes Master Master Queue ↻ Queue Worker 2 Worker 2 ↻ Worker executes ... ... task ter : main RooFit process, submits Jobs to queue, waits for results Maste Ma (or does other things in between) …until Job done then Queue sends results Qu Queue lo loop : act on input from Master or Workers (mainly to avoid loop back to Master on request in Master / user code) --- collect/distribute Jobs and results

  16. Parallel performance (MPFE & MP) Likelihood fits (unbinned, binned) Gradients

  17. Parallel likelihood fits: unbinned, MPFE Run-time vs N(cores): simple N-dim Gaussian, many events timing 3 . 0 Before: max ~2x minimization wall time [s] measured (not pinned) 2 . 5 measured (CPUs pinned) expected (ideal) Now (with CPU 2 . 0 pinning fixed): 1 . 5 max ~20x (more for larger fits) 1 . 0 0 . 5 1 2 3 4 7 8 5 6 number of workers/CPUs

  18. Parallel likelihood fits: certain classes of models, e.g. binned fits with Beeston-Barlow modelling of template uncertainties Run-time vs N(cores): certain types of binned fits Actual performance under investigation Expected performance (ideal parallelization) CP CPU time (s (single co core) )

  19. Fitting method: Quasi-Newton minimization (+ handle multiple minima, physics allowed ranges, confidence intervals) Minuit : minimize PDF !(#; %) : • Quasi-Newton MIGRAD method Left: Newton Right: gradient descent • Gradient + line-search: '( ( )+') +(()) '( • gradient for N parameters % : ') ≈ 2N ! calls à parallelize 2N ') ') 2-3 ! calls à parallelize ! • line-search: descend along gradient direction • Important: serial & parallel results same • non-trivial, Minuit internal transformations

  20. Gradient parallelization First benchmarks: “ ggF model” (gluon-gluon fusion à Higgs boson), MIGRAD fit realistic, non-trivial (265 parameters) scaling not perfect and erratic (+/- 5s) likely caused by communication protocol - under investigation RooMinimizer MultiProcess GradMinimizer - 1 worker 2 workers 3 workers 4 workers 6 workers 8 workers 28s 33s 20s 15s 14s 17s (…) 11s

  21. Conclusions Interactive study of complex LHC physics fits (e.g. Higgs) requires parallelization We improved scaling performance of likelihood-level parallelization Bottlenecks still exist for certain classes of models New flexible framework: multi-level parallelization (likelihood, gradient) First working version, now analysis and tuning performance

  22. Let’s stay in touch +31 (0)6 10 79 58 74 egpbos p.bos@esciencecenter.nl linkedin.com/in/egpbos www.esciencecenter.nl blog.esciencecenter.nl

  23. Encore

  24. Future work Load balancing PDF timings change dynamically due to RooFit precalculation strategies … not a problem for numerical integrals Analytical derivatives (automated? CLAD )

  25. Minuit confidence intervals

  26. Numerical integrals “Analytical” integrals Forced numerical (Monte Carlo) integrals (Higgs fits didn’t have them)

  27. Numerical integrals Individual NI timings Sum of slowest integrals/cores (variation in runs and iterations) per iteration over the entire run Maxima Minima (single core total runtime: 3.2s)

  28. Faster fitting: MultiProcess design RooFit RooFit:: ::MultiProcess MultiProcess::Vector< ::Vector<YourSerialClass YourSerialClass> Serial class: likelihood (e.g. RooNLLVar ) or gradient (Minuit) Interface: subclass + MP Define ”vector elements” Group elements into tasks (to be executed in parallel) RooFit::MultiProcess::SharedArg<T> RooFit::MultiProcess::TaskManager

  29. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> RooFit RooFit:: ::MultiProcess MultiProcess:: ::SharedArg SharedArg<T> <T> Normalization integrals or other shared expensive objects Parallel task definition specific to type of object … design in progress RooFit::MultiProcess::TaskManager

  30. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> RooFit::MultiProcess::SharedArg<T> RooFit RooFit:: ::MultiProcess MultiProcess:: ::TaskManager TaskManager Queue gathers tasks and communicates with worker pool Workers steal tasks from queue Worker pool: forked processes ( BidirMMapPipe ) • performant and already used in RooFit • no thread-safety concerns • instead: communication concerns • … flexible design, implementation can be replaced (e.g. TBB)

  31. MultiProcess for users vector<double> x {1, 4, 5, 6.48074}; xSquaredSerial xsq_serial(x); size_t N_workers = 4; xSquaredParallel xsq_parallel(N_workers, x); // get the same results, but now faster: xsq_serial.get_result(); xsq_parallel.get_result(); // use parallelized version in your existing functions void some_function(xSquaredSerial* xsq); some_function(&xsq_parallel); // no problem!

  32. MultiProcess usage for devs MP::TaskManager MP::Job Parallelized Parallelized MP:: Vector Vector class class Serial Serial class class template <class T> class MP::Vector : public T, public MP::Job class Parallel : public MP:Vector<Serial>

Recommend


More recommend