using openmp for hep framework algorithm scheduling
play

Using OpenMP for HEP Framework Algorithm Scheduling In partnership - PowerPoint PPT Presentation

FERMILAB-SLIDES-19-068-CMS-SCD Using OpenMP for HEP Framework Algorithm Scheduling In partnership with: Dr Christopher D Jones, Dr Patrick Gartung CHEP 2019 4 November 2019 This manuscript has been authored by Fermi Research Alliance, LLC


  1. FERMILAB-SLIDES-19-068-CMS-SCD Using OpenMP for HEP Framework Algorithm Scheduling In partnership with: Dr Christopher D Jones, Dr Patrick Gartung CHEP 2019 4 November 2019 This manuscript has been authored by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics.

  2. Outline Motivation OpenMP Review Demonstrator Frameworks Experiment Setup Results 2 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  3. Motivation Why bother with OpenMP when already using Intel’s Threading Building Blocks? HPC Centers Super Computing Centers traditionally use OpenMP for threading When communicating with HPC specialist, we are often asked about OpenMP Utilization of HPC centers for HEP will only increase over time Need to either use OpenMP or have reason to not use 3 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  4. OpenMP Review OpenMP is an extension to a compiler not a library Uses compiler pragma statements implementations of features vary considerably across compilers OpenMP 4.5 Constructs omp parallel omp for omp task omp taskloop 4 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  5. OpenMP Construct: omp parallel #pragma omp parallel { … } Starts threads used in the following block Once assigned those threads can only be used by that parallel construct At end of block the job waits till all assigned threads finish the block number of threads for each parallel block is controlled by env variable OMP_NUM_THREADS or calling omp_set_num_threads Max number of threads for job is controlled by env variable OMP_THREAD_LIMIT 5 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  6. OpenMP Constructs: omp for #pragma omp for for (int i=0; i< N; ++i){ … } Distributes iterations to threads associated with innermost parallel block By default, calling thread waits till all iterations have completed 6 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  7. OpenMP Construct: nested parallel blocks Support of concurrent nested parallel blocks is implementation defined Also controlled by env variable OMP_NESTED or calling omp_set_nested nested parallel blocks have as many threads as the outer blocks Until max number of threads are reached 7 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  8. OpenMP Construct: nested parallel blocks — example 1 omp_set_num_threads ( 3 ); main #pragma omp parallel for thread for (int i =0; i< 3 ; ++i){ i #pragma omp parallel for j j j Time for (int j =0; j< 3 ; ++j) { doWork (i,j); } } 9 max threads per job main thread waits till nested parallel finished 8 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  9. OpenMP Construct: nested parallel blocks — example 2 main thread i j j j same as before except Time 6 max threads per job finished threads cannot be used by other parallel blocks 9 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  10. OpenMP Construct: omp task #pragma omp task { … } All code in the block is put into a task object An untied task can be run by any thread of the innermost parallel section When a task completes another task can be scheduled on the thread The new task must be from the same parallel section 10 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  11. OpenMP Constructs: omp taskloop #pragma omp taskloop for (int i=0; i< N; ++i){ … } Creates OpenMP tasks for the iterations Calling thread may run other tasks while waiting for all taskloop tasks to end I.e. implementations may do task stealing 11 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  12. Demonstrator Frameworks Created simplified OpenMP, TBB and single threaded based frameworks Frameworks can process multiple events concurrently Work is done via Modules Modules generate data and put into events One Module can depend on data from other Modules Modules are wrapped in OpenMP or TBB tasks Module tasks only start once needed data are available Modules may use parallel for constructs internally Allows testing of nested parallelism Code available at https://github.com/Dr15Jones/toy-mt-framework 12 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  13. Experimental Setup Compiled TBB and OpenMP frameworks with gcc 8 and clang 7 Very different OpenMP 4.5 implementations Created Module call graph that emulated CMS reconstruction Use same module dependencies Use module run times from 100 different events Experiment varied Number of threads Number of concurrent events == number of threads Number of events processed in a job = Number of threads * 100 Amount of module internal parallelism Measurements done on an Intel KNL machine 13 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  14. Module Perfect Parallelism 12 All modules are concurrent capable Event Throughput (ev/sec) TBB results using gcc and clang are 9 identical Ran as many single-threaded jobs as 6 TBB number of threads OpenMP clang OpenMP gcc N Single Threaded 3 OpenMP and TBB have same results 0 0 32 64 96 128 160 192 224 256 Number of Threads & Concurrent Events 14 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  15. One Serial Module with No Internal Parallelism 1 Simulate behavior of output Serialize event access to the output module Event Throughput (ev/s) All other modules are as before 0.75 0.5 TBB Jobs quickly hit Ahmdal’s law limit OpenMP clang OpenMP gcc N Single Threaded 0.25 0 0 4 8 12 16 20 24 28 32 Number of Threads & Concurrent Events 15 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  16. Serial Module with Internal Parallelism: Task Stealing 12 Allow output module to use parallelism Use a for loop with 100 iterations Event Throughput (ev/s) 9 TBB uses tbb::parallel_for does task stealing by default 6 TBB OpenMP uses taskloop OpenMP clang OpenMP gcc clang does task stealing N Single Threaded 3 gcc does not do task stealing Task stealing hurts throughput 0 0 32 64 96 128 160 192 224 256 Number of Threads & Concurrent Events 16 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  17. Serial Module with Internal Parallelism: No Task Stealing 12 Make all versions avoid task stealing Event Throughput (ev/s) TBB use arenas 9 OpenMP uses omp for 6 Only way in API to guarantee no stealing TBB For each (max) number of threads OpenMP clang & gcc N Single Threaded ran many jobs varying omp_set_num_threads 3 chose value with highest throughput Even picking best working point for 0 0 32 64 96 128 160 192 224 256 OpenMP, TBB automatic behavior Number of Threads & Concurrent gives best results Events 17 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  18. Conclusion It is possible to create a HEP framework using OpenMP Our investigation finds it would be less optimal than using TBB Compiler implementation variations make portable performance hard gcc taskloop does not do task stealing clang taskloop does task stealing with no way to disable OpenMP has composibility difficulties parallel blocks do not share threads nested parallelism uses fixed allocation of threads very hard to tune how many threads to use at each nested parallel level 18 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  19. Backup Slides 19 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  20. Task Stealing Problem #pragma omp taskloop main thread for (int i =0; i< 2 ; ++i){ doWork (i); doWork(i) } Time makeTasks (); stolen task E.g. waiting thread steals a long running task makeTasks() Can’t start makeTasks till stolen task finishes 20 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

  21. Scanning Job Results for omp for usage A selection of throughput vs omp_set_num_threads plots Kept maximum number of threads == number of concurrent events for each measurement Max Threads: 32 Max Threads: 48 Max Threads: 64 5 3 4 4 Throughput Throughput Throughput 3 2 3 2 2 1 1 1 gcc8 gcc8 gcc8 clang7 clang7 clang7 0 0 0 24 26 28 30 32 56 58 60 62 64 40 42 44 46 48 omp_set_num_threads omp_set_num_threads omp_set_num_threads Max Threads: 128 Max Threads: 96 9 7 6 Throughput Throughput 5 6 4 3 3 2 gcc8 gcc8 clang7 1 clang7 0 0 112 114 116 118 120 122 124 74 78 82 86 90 omp_set_num_threads omp_set_num_threads 21 Nov 2019 CD Jones I Using OpenMP for HEP Framework Algorithm Scheduling

Recommend


More recommend