testing autofdo for geant4
play

Testing AutoFDO for Geant4 Nathalie Rauschmayr IT-CF-FPP With help - PowerPoint PPT Presentation

Testing AutoFDO for Geant4 Nathalie Rauschmayr IT-CF-FPP With help from Benedikt Hegner and Shahzad Malik Muzaffar 1/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr Introduction Idea: Autotuning Compile 2/33 Testing AutoFDO for Geant4


  1. Testing AutoFDO for Geant4 Nathalie Rauschmayr IT-CF-FPP With help from Benedikt Hegner and Shahzad Malik Muzaffar 1/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  2. Introduction Idea: Autotuning Compile 2/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  3. Introduction Idea: Autotuning Run Compile 3/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  4. Introduction Idea: Autotuning Feedback Run Compile 4/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  5. Introduction Idea: Autotuning Feedback Run Compile Concept exists already for some time: Profile Guided Optimization 4/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  6. Introduction Why it helps to improve performance: 5/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  7. Introduction Why it helps to improve performance: LHC code consists of a lot of branches/dependencies Figure: Example from Geant4: G4MTRunManager::InitializePhysics() 5/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  8. Introduction Profile Guided Optimization is useful for: • Code that contains a lot of branches that are difficult to predict at compile time • Performance sensitive code • When running the same code over and over again 6/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  9. Introduction Profile Guided Optimization: • Uses profiling to improve runtime performance • Analyses code sections that are frequently executed • Based on profiles the compiler might change: • Inlining • Virtual Call Speculation • Register allocation • Basic Block Optimization • Function Layout • Conditional Branch Optimization • Dead Code Separation 7/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  10. Introduction Two approaches for Profile Guided Optimization (PGO): • Modify binary (instrumentation) • Monitor unaltered binary (sampling with perf) • AutoFDO transforms perf-profiles into the format that can be used by gcc/clang for Feedback Directed Optimization (FDO) • Developed by Google https://github.com/google/autofdo 8/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  11. Difference between sampling and instrumentation Instrumentation based PGO: gcc -fprofile-use test.c -o test gcc -fprofile-generate test.c -o test test.gcno test.gcda Instrumentation Run Recompile Production Environment 9/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  12. Difference between sampling and instrumentation Instrumentation based PGO: gcc -fprofile-use test.c -o test gcc -fprofile-generate test.c -o test test.gcno test.gcda Instrumentation Run Recompile Production Environment Disadvantages: • Tedious dual-compilation • Produces a lot of small output files (in case of Geant4: 1698 files, each smaller than 100KB) • Cannot run easily in production environment • Instrumented binary might be significantly slower 9/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  13. Difference between sampling and instrumentation Sampling Based FDO (AutoFDO): Run production binary with perf Create production binary Convert perf-profile Recompile with converted perf-profile 10/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  14. Difference between sampling and instrumentation Sampling Based FDO (AutoFDO): perf record -b -e cpu/event=0xc4,umask=0x20, name=br inst retired near taken, gcc -O3 -ggdb period=1000009/pp ./test -frecord-compilation-info-in-elf Run production binary with perf -D DEBUG test.c -o test Create production binary create gcov --binary=./test --profile=perf.data --gcov=binary.gcov -gcov version=1 Convert perf-profile gcc -O3 -fauto-profile=test.gcov test.c -o test Recompile with converted perf-profile 11/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  15. Difference between sampling and instrumentation AutoFDO compared to instrumentation based PGO: • Profile data can be obtained in production environment • Works on optimized builds • It provides a tool to merge profiles from multiple runs • Only one output file per run 12/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  16. General Caveats • The sample needs to be representative for the typical usage scenarios • Otherwise: PGO could possible slow down the performance • Need many profiles and runs • Unbiased branches 13/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  17. Testcases Applications: • CMS Detector Simulation (FullCMS) • Simulation step of CMSSW using static build of Geant4 (cmsRun) Input data/workflow needs to be representative: • How many events needed as training data? • What if job configuration changes? • What if job type changes? 14/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  18. Testcases Training data Run Number of Events FullCMS run FullCMS run 100,500,1k cmsRun config1 cmsRun config1 20, 50, 100 cmsRun config1 cmsRun config2 20, 50, 100 cmsRun config2 cmsRun config2 20, 50, 100 FullCMS run cmsRun config2 1k FullCMS: Geant4 example with particle gun cmsRun config1: TTbar event generation and simulation (CMSSW 7 3 1) cmsRun config2: Wjets event generation and simulation (CMSSW 7 3 1) 15/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  19. CMS Full Detector Simulation Training data Run Number of Events FullCMS run 100 events FullCMS 100, 500, 1k FullCMS run 500 events FullCMS 100, 500, 1k FullCMS run 1k events FullCMS 100, 500, 1k Processing 100 events Processing 500 events 170 700 -10.4% -8.9% 160 -11.5% Runtime in [s] Runtime in [s] 650 -9.5% -10.2% 150 -9.8% 140 600 130 Normal AutoFDO AutoFDO AutoFDO Normal AutoFDO AutoFDO AutoFDO 100 events 500 events 1000 100 events 500 events 1000 events Events 16/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  20. CMS Full Detector Simulation Processing 1000 events 1 , 400 1 , 350 Runtime in [s] 1 , 300 -10.3% -10.7% -11.4% 1 , 250 1 , 200 1 , 150 Normal AutoFDO AutoFDO AutoFDO 100 events 500 events 1000 events 17/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  21. Simulation step of CMSSW using BigProducts Used CMSSW 7 3 1: • SLC6, kernel 3.16 • gcc 4.8 • It uses BigProducts by default (developed by Shazhad) • pluginSimulation.so: linked against static Geant4 libraries • Obtain perf-profile for cmsRun, but then optimize only pluginSimulation.so Testcase: TTbar • Step 1: Event generation and simulation • 20, 50, 100 events 18/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  22. Simulation step of CMSSW using BigProducts Training data Run Number of Events cmsRun 20 events config1 cmsRun config1 20, 50, 100 cmsRun 50 events config1 cmsRun config1 20, 50, 100 cmsRun 100 events config1 cmsRun config1 20, 50, 100 Processing 20 events Processing 50 events 580 1 , 350 Runtime in [s] 560 Runtime in [s] -7.1% 1 , 300 -6.5% 540 -6.1% -7.8% -8.4% -7.0% 1 , 250 520 Normal AutoFDO AutoFDO AutoFDO Normal AutoFDO AutoFDO AutoFDO 20 events 50 events 100 events 20 events 50 events 100 Events 19/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  23. Simulation step of CMSSW using BigProducts Processing 100 events 2 , 700 Runtime in [s] 2 , 600 -6.5% -7.4% -7.4% 2 , 500 Normal AutoFDO AutoFDO AutoFDO 20 events 50 events 100 events 20/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  24. Simulation step of CMSSW using BigProducts cmsRun config2: took Pythia configurations from Wjet Pt 3000 3500 14TeV cfi.py in CMSSW 8 1 X Training data Run Number of Events cmsRun 100 events config1 cmsRun config2 20, 50, 100 Processing 20 events Processing 50 events Processing 100 events 8 , 500 1 , 850 4 , 400 1 , 800 4 , 200 Runtime in [s] Runtime in [s] Runtime in [s] 8 , 000 1 , 750 4 , 000 -8.9% 1 , 700 -12.5% -11.9% 7 , 500 1 , 650 3 , 800 1 , 600 Normal AutoFDO Normal AutoFDO Normal AutoFDO 100 events 100 events 100 events 21/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  25. Simulation step of CMSSW using BigProducts Training data Run Number of Events cmsRun 100 events config1 cmsRun config2 20, 50, 100 cmsRun 100 events config2 cmsRun config2 20, 50, 100 Processing 20 events Processing 50 events Processing 100 events 8 , 500 1 , 850 4 , 400 1 , 800 4 , 200 Runtime in [s] Runtime in [s] Runtime in [s] 8 , 000 1 , 750 -9.2% -8.5% 4 , 000 1 , 700 -8.9% -7.3% -12.5% -11.9% 7 , 500 1 , 650 3 , 800 1 , 600 Normal AutoFDO AutoFDO Normal AutoFDO AutoFDO Normal AutoFDO AutoFDO 100 events 100 events 100 events 100 events 100 events 100 events 22/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

  26. Simulation step of CMSSW using BigProducts Training data Run Number of Events fullcms 100 events cmsRun job config2 20, 50, 100 Processing 20 events Processing 50 events Processing 100 events 1 , 380 580 2 , 750 1 , 360 -3.8% 570 1 , 340 2 , 700 Runtime in [s] Runtime in [s] Runtime in [s] 1 , 320 560 2 , 650 1 , 300 -4.8% 550 -5.1% 1 , 280 2 , 600 540 1 , 260 2 , 550 Normal AutoFDO Normal AutoFDO Normal AutoFDO 100 events 100 events 100 events 23/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

Recommend


More recommend