SC20 Booth Talk Series LB4OMP: A Load Balancing Portfolio for - PowerPoint PPT Presentation

SC’20 Booth Talk Series LB4OMP: A Load Balancing Portfolio for OpenMP Jonas H. Müller Korndörfer, PhD Student University of Basel, Switzerland

Outline ⟡ Motivation ⟡ History ⟡ LB4OMP ⟡ Scheduling Techniques Florina Ciorba Ali Mohammed ⟡ Performance Measurement Features ⟡ Usage ⟡ Performance Evaluation hpc.dmi.unibas.ch ⟡ Take Home Messages Ahmed Eleliemy Jonas H. Müller Korndörfer MLS: Multilevel Scheduling in Large Scale High Performance Computers LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 2

Motivation ⟡ Load imbalance in OpenMP codes ⟡ Lowers performance ⟡ Wastes resources and energy ⟡ Increases waiting times in jobs queues ⟡ Missing implementation for the state of the art in literature for load balancing ⟡ Hinders research for novel load balancing algorithms ⟡ Hinders performance optimization LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 3

History IWOMP 2018: “ OpenMP loop scheduling revisited: making a case for more schedules ”, Ciorba, Florina M., Christian Iwainsky, and Patrick Buder. ⟡ GNU OpenMP runtime library ISPDC 2019: “ Exploring loop scheduling enhancements in OpenMP: an LLVM case study ”, Kasielke, Franziska, Ronny Tschüter, Christian Iwainsky, Markus Velten, Florina M. Ciorba, and Ioana Banicescu. ⟡ LLVM OpenMP runtime library SC19 poster: “ A Runtime Approach for Dynamic Load Balancing of OpenMP Parallel Loops in LLVM ”, Müller Korndörfer, Jonas H., Ciorba, Florina M., Yilmaz, A., Iwainsky, C., Doerfert, J., Finkel, H., Kale, V. and Klemm, M. ⟡ LLVM OpenMP runtime library Today: LB4OMP ⟡ LLVM OpenMP runtime library ⟡ Swiss Army knife for load balancing in OpenMP LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 4

In a Nutshell ⟡ LB4OMP: A Load Balancing Portfolio for OpenMP ⟡ Bridges the gap between the state of the art load balancing literature and state of the practice in multithreaded applications ⟡ Enhanced LLVM OpenMP runtime library ⟡ Dynamic and non-adaptive self-scheduling techniques (9) ⟡ Dynamic and adaptive self-scheduling techniques (8) ⟡ Performance measurement features ✓ LB4OMP: github.com/unibas-dmi-hpc/LB4OMP LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 5

LB4OMP: Scheduling Techniques ⟡ Static scheduling technique (1) ⟡ Static → OpenMP standard ⟡ Dynamic and non-adaptive self-scheduling techniques (9) ⟡ SS : Dynamic, 1 → OpenMP standard ⟡ GSS : Guided → OpenMP standard ⟡ FSC : Fixed size chunk → requires profiling ⟡ TSS : Trapezoid self-scheduling → LLVM ⟡ FAC : Factoring → requires profiling ⟡ mFAC : modified implementation of FAC → requires profiling ⟡ FAC2 : practical variant of factoring ⟡ TAP : Tapering → requires profiling ⟡ WF2 : practical variant of weighted factoring ⟡ Dynamic and adaptive self-scheduling techniques (8) ⟡ BOLD → requires profiling ⟡ AWF-B,C,D,E : Adaptive weighted factoring and its variants ⟡ AF : Adaptive factoring ⟡ mAF : modified implementation of AF LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 6

LB4OMP: Performance Measurement Features loop occurrence, location, iterations, thread ID, thread execution time Each thread: location, iterations, parallel loop execution time Each parallel loop: location, lower bound, upper bound, chunk size, thread ID Each scheduling round: Warning: it can produce very large files! location, mean iteration execution time, standard deviation execution Profiling: time all iterations LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 7

LB4OMP: Usage ⟡ Basic configuration Do the target OpenMP loops in the Add the path to the compiled OMP_SCHEDULE=technique,chunk application contain LB4OMP to the linker environment KMP_CPU_SPEED=clock frequency in MHz schedule(runtime) clause? variable In Linux: In Linux: If yes, no recompilation is required LD_LIBRARY_PATH=path/LB4OMP cat /proc/cpuinfo ⟡ Performance measurement features configuration Each thread Each scheduling round Profiling Each parallel loop KMP_PRINT_CHUNKS=1 KMP_TIME_LOOPS=path/file OMP_SCHEDULE=profiling KMP_TIME_LOOPS=path/file KMP_PROFILE_DATA=path/file LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 8

Performance Evaluation ⟡ Application ⟡ SPHYNX executed 5 times for each configuration , available at astro.physik.unibas.ch/people/ruben-cabezon/sphynx.html ⟡ 2 main OpenMP loops , each with 1,000,000 iterations, executed 20 times for each SPHYNX execution ⟡ L0 , find neighbours ⟡ L1 , gravity calculation ⟡ Node types ⟡ Type A , Intel Broadwell E5-2640 v4 ( 2 sockets, 10 cores each ) ⟡ Type B , Intel Xeon Phi KNL 7210 ( 1 socket, 64 cores ) ⟡ Type C , Intel Xeon E5-2690 v3 ( 1 socket, 12 cores ) ⟡ Metrics ⟡ Parallel execution time ⟡ Parallel execution time per loop ⟡ Loss of performance compared to the Best combination of scheduling technique per loop SPHYNX and LB4OMP were compiled with Intel compiler version 19.0.1.144 The threads were always configured with OMP_PLACES=cores OMP_PROC_BIND=close LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 9

Performance Evaluation Best Most time consuming OpenMP standard LLVM combination loop ⟡ Best per loop performing scheduling technique ∙ ∙ ∙ is a combination of ⟡ Best 20 threads scheduling techniques LB4OMP ⟡ In this case, FSC achieves the highest performance alone ⟡ Performance degradation (xx.xx%) by executing the application with a single scheduling technique ⟡ Best achieves up to 13.32% higher performance than the best standard, in this case GSS ⟡ mAF alone achieves up to 9.59% higher performance than the best standard, in this case GSS LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 10

Performance Evaluation Best Most time consuming OpenMP standard LLVM combination loop ⟡ Best per loop performing scheduling technique ∙ ∙ ∙ is a combination of ⟡ Best 64 threads scheduling techniques LB4OMP ⟡ In this case, FSC and mAF ⟡ This time the performance of FSC alone is 23.03% lower than Best ⟡ AF alone achieves up to 0.75% higher performance than the best standard, in this case GSS ⟡ The adaptive techniques achieve comparable or higher performance than the best standard, this case GSS LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 11

Performance Evaluation Best Most time consuming OpenMP standard LLVM combination loop ⟡ Best per loop performing scheduling technique ∙ ∙ ∙ is a combination of scheduling ⟡ Best 12 threads techniques LB4OMP ⟡ In this case, FSC and mAF ⟡ FSC is practically the best alone achieving only 0.01% lower performance than Best ⟡ Best achieves up to 12.89% higher performance than the best standard technique alone, in this case GSS ⟡ mAF alone achieves up to 10.67% higher performance than the best standard, in this case GSS ⟡ The performance of the adaptive techniques remains constant across different platforms LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 12

Take Home Messages ⟡ LB4OMP portfolio bridges the gap between the load balancing literature and practice in multithreaded applications ⟡ LB4OMP contains 14 additional to the OpenMP standard and ready to use dynamic (and adaptive ) self-scheduling techniques ⟡ First and necessary step for an auto-tuning load balancing approach in OpenMP ⟡ Loops are frequently different presenting divergent load balancing needs ⟡ The Best combination of scheduling techniques frequently outperform the usage of a single technique ⟡ It is impractical to achieve Best only by experimentation ⟡ The dynamic and adaptive self-scheduling techniques are a promising alternative to achieve a performance close to Best ⟡ What’s next? ⟡ Patch and upstream the scheduling techniques to LLVM ⟡ Ongoing working on an automated approach to achieve Best performance LB4OMP: github.com/unibas-dmi-hpc/LB4OMP LB4OMP: A Load Balancing Portfolio for OpenMP, J. H. Müller Korndörfer SC’20 OpenMP Booth Talk 13

openmp.org OpenMP API specs, forum, reference guides, and more link.openmp.org/sc20 Videos and PDFs of OpenMP SC’20 presentations

SC20 Booth Talk Series LB4OMP: A Load Balancing Portfolio for - PowerPoint PPT Presentation

SC20 Booth Talk Series LB4OMP: A Load Balancing Portfolio for OpenMP Jonas H. Mller Korndrfer, PhD Student University of Basel, Switzerland Outline Motivation History LB4OMP Scheduling Techniques Florina Ciorba Ali

Cloud: How Big Is Your Risk? Prasidh Srikanth Booth #450 Agenda Cloud BYOD Security Booth

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

standard series Overview DP series DX series H series M series bitte hier

Decision Making in the Voting Booth Lesson #05 October 28, 2008 Dean Bible Ministries

Decision Making in the Voting Booth Lesson #02 October 21, 2008 Dean Bible Ministries

Decision Making in the Voting Booth Lesson #04 October 26, 2008 Dean Bible Ministries

What is it, and how can we make it work for both teachers and students? Nikki Booth Institute

Decision Making in the Voting Booth Lesson #01 October 19, 2008 Dean Bible Ministries

Decision Making in the Voting Booth Lesson #03 October 23, 2008 Dean Bible Ministries

E- -Series: Series: Water Mist Extinguishers Water Mist Extinguishers E E- -Series: Series:

Fourier Series Fourier Sine Series Fourier Cosine Series Fourier Series Convergence

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

Outline Time series and forecasting Time series objects 1 in R Basic time series functionality

Kings Kings Kings Series Kings Series Series Series Lesson Lesson #107 #107 July July 18,

Revelation Revelation Series Revelation Revelation Series Series Series Lesson Lesson #236

Art Basel Weekend 18-20I6I10 Hall 2.0 I Booth A14 FILM GALLERY MELANIE MANCHOT (* 1966)

Culture, Religion, and Inequality Andrew J. Perrin - SOCI 101.004 September 22, 2015 Andrew J.

Announcements U nit 4: I nference for numerical variables L ecture 1: T wo samples - paired and

Affirmative Action through Minority Reserves: An Experimental Study on School Choice Flip Klijn,

Todays Guest Speakers: Ms. Alexandra Rouse, Ms. Jihyun (Ji) Huyck, IT Professional Services

Algebraic Dependencies and PSPACE Algorithms in Approximative Complexity Zeyu Guo 1 Nitin Saxena 1

Big Data Funding Opportunities Jill Morris Institute for Population Research Slides updated

A Scalable Hybrid Introductory ODE Course C. David Levermore University of Maryland, College

Inequality: Race, Gender, Class, and Politics