Argobots and its Application to Charm++ Sangmin Seo Assistant - PowerPoint PPT Presentation

Argobots and its Application to Charm++ Sangmin Seo Assistant Computer Scientist Argonne National Laboratory sseo@anl.gov April 19, 2016 Charm++ Workshop 2016

Argo Concurrency Team Argonne National Laboratory (ANL) • – Pavan Balaji (co-lead) – Sangmin Seo – Abdelhalim Amer – Marc Snir – Pete Beckman (PI) University of Illinois at Urbana-Champaign (UIUC) • – Laxmikant Kale (co-lead) – Prateek Jindal – Jonathan Lifflander University of Tennessee, Knoxville (UTK) • – George Bosilca Past Team Members: – Thomas Herault Cyril Bordage (UIUC) • – Damien Genet Esteban Meneses • Pacific Northwest National Laboratory (PNNL) • (University of Pittsburgh) Huiwei Lu (ANL) – Sriram Krishnamoorthy • Yanhua Sun (UIUC) • 2 Charm++ Workshop 2016

Massive On-node Parallelism The number of cores is increasing • Massive on-node parallelism is inevitable • Existing solutions do not effectively deal with such parallelism with • respect to on-node threading/tasking systems or with respect to off-node communication in the presence of such tasks/threads How to exploit? • core Core-level Parallelism 3 Charm++ Workshop 2016

Shortcomings today? Pthreads (1/2) Execution time for 36 threads in the outer loop Nesting GCC/pthreads GCC/Argobots ULTs GCC/Argobots tasks int in[1000][1000], out[1000][1000]; 3.5 3.0 #pragma omp parallel for 2.5 for (i = 0; i < 1000; i++) { 2.0 Time (s) petsc_voodoo(i); 1.5 } Lower is 1.0 better 0.5 petsc_voodoo(int x) 0.0 { 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 #pragma omp parallel for # OMP Threads | Argobots ULTs/tasks (inner loop) Why is traditional OpenMP’s performance so bad? The for (j = 0; j < 1000; j++) compiler cannot analyze petsc_voodoo to know whether out[x][j] = cosine(in[x][j]); the function might ever block or yield, so it has to assume } that it might. Therefore a stack is needed to facilitate it. Creating additional Pthreads for each nesting is the simplest way to achieve this. 4 Charm++ Workshop 2016

Shortcomings today? Pthreads (2/2) Tasks of application mapped to a group of Pthreads How about these communications? Wait or context switch? Work units intermixed with blocking calls (such as communication calls) can cause idle cores map & schedule Pthreads C C C C computation Need lightweight mechanisms to switch tasks! communication 5 Charm++ Workshop 2016

Outline • Background • Argobots • Charm++ with Argobots • Other Programming Models • Summary 6 Charm++ Workshop 2016

User-Level Threads (ULTs) What is user-level thread (ULT)? • ULT 1 – Provides thread semantics in user space – Execution model: cooperative timesharing Context • More than one ULT can be mapped to a single kernel ULT 2 switch thread • ULTs on the same OS thread do not execute concurrently Context Where to use? • switch – To better overlap computation and communication/IO timeline – To exploit fine-grained task parallelism ULTs : Kernel threads : Core Core Core Core Core Core Core Core 7 Charm++ Workshop 2016

Pthreads vs. ULTs pthread ULT (Argobots) Avg. Create&Join Time/thread 100000 10000 1000 (ns) 100 10 1 1 2 4 8 16 32 64 128 256 512 1024 2048 Number of Threads Average time for creating and joining one thread • pthread: 6.6us - 21.2us (avg. 34,953 cycles) • ULT (Argobots): 78ns - 130ns (avg. 191 cycles) • ULT is 64x - 233x faster than Pthread • – How fast is ULT? L1$ access: 1.112ns, L2$ access: 5.648ns, memory access: 18.4ns • Context switch (2 processes): 1.64us • * measured using LMbench3 8 Charm++ Workshop 2016

Growing Interests in ULTs ULT and task libraries • – Converse threads, Qthreads, MassiveThreads, Nanos++, Maestro, GnuPth, StackThreads/MP , Protothreads, Capriccio, StateThreads, TiNy-threads, etc. OS supports • – Windows fibers, Solaris threads Language and programming models • – Cilk, OpenMP task, C++11 task, C++17 coroutineproposal, Stackless Python, Go coroutines, etc. Pros • – Easy to use with Pthreads-like interface Cons • – Runtime tries to do something smart (e.g., work-stealing) – This may conflict with the characteristics and demands of applications 9 Charm++ Workshop 2016

Argobots A low-level lightweight threading and tasking framework (http://collab.cels.anl.gov/display/argobots/) Overview Programming Models Separation of mechanisms and policies • (MPI, OpenMP, Charm++, PaRSEC, …) Massive parallelism • – Exec. Streams guarantee progress Argobots – Work Units execute to completion User-level threads (ULTs) vs. Tasklet • Shared pool Private pool Private pool Clearly defined memory semantics • U – Consistency domains U U T U T T Provide Eventual Consistency • U T – Software can manage consistency Execution Execution Execution Argobots Innovations Stream Stream Stream Enabling technology, but not a policy maker • – High-level languages/libraries such as OpenMP, Charm++ have more information about the user application (data locality, dependencies) Explicit model : • Processor core – Enables dynamism, but always managed Lightweight U User-Level Thread Tasklet by high-level systems T Work Units * Team members: Sangmin Seo, Abdelhalim Amer, Pavan Balaji (ANL), Laxmikant Kale, Prateek Jindal (UIUC) 10 Charm++ Workshop 2016

Argobots Execution Model Execution Streams (ES) • ES 1 ES n – Sequential instruction stream Sched • Can consist of one or more work units – Mapped efficiently to a hardware resource E U S ... – Implicitly managed progress semantics U E T • One blocked ES cannot block other ESs S E T T User-level Threads (ULTs) • E U T T – Independent execution units in user space – Associated with an ES when running S U E T – Yieldable and migratable Pool ULT Tasklet Event Scheduler – Can make blocking calls Argobots Execution Model Tasklets • – Atomic units of work Scheduler • – Asynchronous completion via Stackable scheduler with pluggable – strategies notifications Synchronization primitives • – Not yieldable, migratable before Mutex, condition variable, barrier, future – execution Events • – Cannot make blocking calls Communication triggers – 11 Charm++ Workshop 2016

Explicit Mapping ULT/Tasklet to ES • The user needs to map work units to ESs • No smart scheduling, no work-stealing unless the user wants to use ES 1 ES 2 • Benefits – Allow locality optimization U0 T1 • Execute work units on the same ES U1 T2 – No expensive lock is needed between ULTs on the same ES U2 U4 • They do not run concurrently U3 U5 • A flag is enough 12 Charm++ Workshop 2016

Stackable Scheduler with Pluggable Strategies • Associated with an ES • Can handle ULTs and tasklets • Can handle schedulers – Allows to stack schedulers hierarchically • Can handle asynchronous events • Users can write schedulers Sched – Provides mechanisms , not policies – Replace the default scheduler S E U • E.g., FIFO, LIFO, Priority Queue, etc. • ULT can explicitly yield to another ULT U E T – Avoid scheduler overhead S E T T U E U S T U U U T yield() yield_to(target) 13 Charm++ Workshop 2016

Performance: Create/Join Time Ideal scalability • – If the ULT runtime is perfectly scalable, the time should be the same regardless of the number of ESs Qthreads MassiveThreads (H) MassiveThreads (W) Argobots (ULT) Argobots (Tasklet) 10000 Create/Join Time per ULT (cycles) 1000 100 10 1 2 4 8 16 24 32 36 40 48 56 64 72 Number of Execution Streams (Workers) 14 Charm++ Workshop 2016

Charm++ with Argobots Jonathan Lifflander, Prateek Jindal, Yanhua Sun Laxmikant Kale University of Illinois at Urbana-Champaign (UIUC) Charm++ Workshop 2016 15

Charm++ with Argobots Goals • – Test the completeness and performance of Argobots with Charm++ programming model – Take advantage of Argobots features (tasklets, stackable schedulers, etc.) without modifying application codes – For Charm++ applications, interoperate with applications written in other models (MPI, Cilk, etc.) Mini-apps and real world applications Charm++ model Intelligent runtime Converse runtime Argobots (threading, messaging, scheduler) (ULTs, Tasks, scheduling, etc.) Communication libraries (MPI, uGNI, PAMI, Verbs) Charm++ infrastructure Charm++ with Argobots * Team members: Laxmikant Kale, Jonathan Lifflander, PrateekJindal (UIUC) 16 Charm++ Workshop 2016

Replacing the Converse Runtime with Argobots Converse runtime Argobots (threading, messaging, scheduler) (ULTs, Tasks, scheduling, etc.) • Converse – The active messaging layer in Charm++ • Approaches – Each Charm++ Pthread inside a node (including the communication thread) is implemented as an Argobots ES • Create an ES for every Converse instance – A custom Argobots scheduler is created instead of using the Converse scheduler – Converse messages are enqueued into Argobots pools as tasklets – Converse threads (CthThread) are implemented on top of Argobots ULTs, with conditional variables to implement suspend/resume • Only 180 lines of code had to be changed! 17 Charm++ Workshop 2016

Argobots and its Application to Charm++ Sangmin Seo Assistant - PowerPoint PPT Presentation

Argobots and its Application to Charm++ Sangmin Seo Assistant Computer Scientist Argonne National Laboratory sseo@anl.gov April 19, 2016 Charm++ Workshop 2016 Argo Concurrency Team Argonne National Laboratory (ANL) Pavan Balaji

Recent Results in Charm Physics Recent Results in Charm Physics Topics Topics Rare Charm

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

Charm physics and XYZ states at BESIII Evgeny BOGER JINR Dubna On behalf of BESIII

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

Charm4py: Parallel Programming with Python and Charm++ Juan Galvez May 1, 2019 17 th Annual

A Parallel Union-Find Library in Charm ++ Karthik Senthil Parallel Programming Laboratory

Charm and and bottom bottom Heavy baryon Heavy baryon Charm mass spectrum from from mass

Combination and QCD Analysis of Charm Production Cross Section Measurements in DIS at HERA Kenan

CHARM Community Health And Resources Management A Scenario Planning Mapping Tool Yu Wen Chou

Charm++ as an Energy Efficient Runtime 1 4/18/17 BILGE ACUN - CHARM++ WORKSHOP 2017 Interaction

CHARM 2016 @ Bologna Italy Angelo Carbone on behalf of Department of Physics CHARM 2015 and

Baryon bound states of three hadrons with charm and hidden charm Chu-Wen Xiao (

Review of recent developments on leptonic and semileptonic charm decays from lattice QCD

Eventful Sessions: Eventful Sessions: Types, Programming and Bisimilarity Raymond Hu, Dimitrios

Hete terog ogene neous C ous Conc oncur urrenc ncy Michael L. Scott (on leave at Google

Why Events Are A Bad Idea (for high-concurrency servers) Rob von Behren, Jeremy Condit and Eric

Capital Opportunities in Agriculture THANK YOU FOR JOINING US! THE WEBINAR WILL BEGIN SHORTLY.

Hyperconnected Access to Archival Music Collections: Cataloging, Finding Aids, and Social Media

Threads and DragonFly BSD Improving Thread Performance on DragonFly BSD Conduits for program

DAQ Giovanna Lehmann Miotto FS Installation Workshop August 21 st 2019 DAQ Baseline foresees

A house for all peoples Is 56:1-8 Grass and animal skin Mud wattle Grass thatch Dung-covered