Improving C HARM ++ Performance with a NUMA-aware Load Balancer - PowerPoint PPT Presentation

9 th Annual Workshop on C HARM ++ and its Applications Improving C HARM ++ Performance with a NUMA-aware Load Balancer Laércio Lima Pilla 1,2 , Christiane Pousa 2 , Daniel Cordeiro 2,3 , Abhinav Bhatele 4 , Philippe O. A. Navaux 1 , Jean-François Méhaut 2 , Laxmikant V. Kale 4 1 Federal University of Rio Grande do Sul – Porto Alegre, Brazil 2 Grenoble University – Grenoble, France 3 University of São Paulo – São Paulo, Brazil 4 University of Illinois at Urbana-Champaign – Urbana, IL, USA

Summary How we used NUMA architectural information to build a C HARM ++ load balancer and obtained improvements on overall performance. /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 2

Agenda NUMA Our Load Balancer: N UMA LB Experimental Setup Results Concluding Remarks /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 3

UMA x NUMA Uniform Memory Access Non-Uniform Memory Access • Centralized shared memory • Distributed shared memory – Uniform latencies – Non-uniform latencies • Data placement does not • Data placement matters matter Address space P P P P M M M M Interconnection P P P P Memory Interconnection Processor /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 4

NUMA Reduce latencies Core C C C C C C C C M0 M1 M0 M1 C C C C C C C C C C C C C C C C M2 M3 M2 M3 C C C C C C C C C C C C C C C C M4 M5 M4 M5 C C C C C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 5

NUMA Reduce contention/improve bandwidth C C C C C C C C M0 M1 M0 M1 C C C C C C C C C C C C C C C C M2 M3 M2 M3 C C C C C C C C C C C C C C C C M4 M5 M4 M5 C C C C C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 6

NUMA C HARM ++ does not consider these characteristics Physical organization C HARM ++ ’s vision (UMA) No memory hierarchy C C C C No locality M0 M1 C C C C C C C C C C C C C C C C C C M2 M3 C C C C C C M C C C C C C C C C C C C M4 M5 C C C C C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 7

Load Balancer • Application data – C HARM ++ LB framework – Processor load: execution time – Chare load: execution time – Communication graph: size and number of messages • NUMA topology – archTopology (our library) – Core to NUMA node (socket) hierarchy mapping – NUMA factor NUMA factor (i, j) = Read latency from i to j Read latency on i /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 9

Load Balancer • Heuristic – Task mapping is NP-Hard – No initial assumptions about the application • List scheduling – Put tasks on a priority list by load – Assign tasks to the processor with the smallest cost on a greedy fashion • Improve performance – by reducing unbalance – by reducing remote communication costs – while avoiding migrations (data movement costs) /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 10

Load Balancer • Cost function cost ( c , p ) = load ( p ) + ɑ × ( r comm ( c , p ) × NUMA factor – l comm ( c , p ) ) Where c : chare p : core load ( p ): load (execution time) on core p r comm ( c , p ): number of messages sent by chare c to chares on other NUMA node l comm ( c , p ): number of messages sent by chare c to chares on the same NUMA node ɑ : communication weight /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 11

Load Balancer Input : C set of chares, P set of cores, M mapping N UMA LB ’s Algorithm Output : M’ mapping of chares to cores 1. M’ ← M 2. while c ≠ Ø do for the number of chares 3. c ← v | v ϵ arg max u ϵ C load ( u ) take heaviest chare 4. C ← C \{ c } 5. p ← q , q ϵ P Ʌ {( c , q )} ϵ M get its core 6. load ( p ) ← load ( p ) − load ( c ) remove its load from its core 7. M’ ← M’ \ {( c , p )} remove from mapping 8. p’ ← q | q ϵ arg min r ϵ P cost ( c,r ) find core with smallest cost 9. load ( p’ ) ← load ( p’ ) + load ( c ) add chare load to new core 10. M’ ← M’ U {( c , p ’ )} map to new core /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 12

Experimental Setup • 2 NUMA machines • 3 C HARM ++ benchmarks • 4 other C HARM ++ load balancers • Statistical confidence of 95% – 5% relative error – Student’s t -distribution – Minimum of 25 executions • Performance – Gains: Average iteration time (baseline = no LB) – Costs: Load balancing overhead /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 14

Experimental Setup: Machines • NUMA16 L2 L2 L2 L2 M M – AMD Opteron C C C C – 8×2 cores @ 2.2 GHz L2 L2 L2 L2 M M – 1 MB private L2 cache C C C C – 32 GB main memory L2 L2 L2 L2 – Low latency for M M C C C C memory access – Crossbar L2 L2 L2 L2 M M – NUMA factor: 1.1 – 1.5 C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 15

Experimental Setup: Machines • NUMA32 L2 L2 L2 L2 L2 L2 L2 L2 C C C C C C C C – Intel Xeon X7560 M L3 M L3 – 4×8 cores @ 2.27 GHz C C C C C C C C – 256 KB private L2 L2 L2 L2 L2 L2 L2 L2 L2 – 24 MB shared L3 L2 L2 L2 L2 L2 L2 L2 L2 – 64 GB main memory C C C C C C C C – QuickPath M L3 M L3 – NUMA factor: 1.36 – C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 3.6 /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 16

Experimental Setup: Benchmarks • kNeighbor – Synthetic iterative benchmark where a chare communicates with other k chares at each step – Completely I/O bound – 200 chares, 16 KB messages, k = 8 • lb_test – Synthetic unbalanced benchmark with different possible communication patterns – 200 chares, random communication graph, load between 50 and 200 ms • jacobi2D – Unbalanced two-dimensional five-point stencil – 100 chares, 32² data array /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 17

Experimental Setup: LBs • G REEDY LB – Iteratively maps the most loaded chares to the least loaded cores • R EC B IPART LB – Recursive bipartition of the communication graph – Breadth-first traversal until groups the required load • M ETIS LB – Graph partitioning algorithms from METIS • S COTCH LB – Graph partitioning algorithms from SCOTCH • Neither consider the current chare mapping /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 18

Results: kNeighbor 35 32.0 Smaller is Average iteration time (in ms) No sensible difference better 30 26.9 among LBs 30% 25 22.8 22.4 22.1 22.1 21.5 20 17.9 45% 16.9 16.1 14.9 13.5 15 10 5 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 20

Results: kNeighbor Homogeneous distribution 35 32.0 Average iteration time (in ms) Shared cache, faster communication Group chares and 30 26.9 migrate them together 30% 25 22.8 22.4 to the same core 22.1 22.1 21.5 20 17.9 45% 16.9 16.1 14.9 13.5 15 10 5 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 21

Results: lb_test 1.2 Best performance by communication-aware LBs Average iteration time (in s) 1.01 1 0.93 0.88 0.84 0.83 0.83 Best average performance 17% 0.8 0.6 0.6 0.51 0.47 0.46 0.43 0.43 28% 0.4 0.2 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 22

Results: jacobi2D 2 Best performance. 1.74 Average iteration time (in s) 1.8 Keeps proximity among chares 1.6 on a NUMA node scale 1.31 1.4 1.24 1.21 41% 1.11 S COTCH LB shows similar 1.2 1.03 performance 1 0.8 0.6 0.42 0.4 0.39 0.36 0.29 0.4 0.27 36% 0.2 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 23

Results: jacobi2D - Projections M ETIS LB: 75% efficiency • jacobi2D on NUMA16 – 2 steps before LB – 4 steps after LB N UMA LB: 93.5% efficiency • The smaller the idle parts, the higher the efficency /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 24

Improving C HARM ++ Performance with a NUMA-aware Load Balancer - PowerPoint PPT Presentation

9 th Annual Workshop on C HARM ++ and its Applications Improving C HARM ++ Performance with a NUMA-aware Load Balancer Larcio Lima Pilla 1,2 , Christiane Pousa 2 , Daniel Cordeiro 2,3 , Abhinav Bhatele 4 , Philippe O. A. Navaux 1 , Jean-Franois

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

Introduction to Harm Reduction Definition of Harm Reduction Harm reduction refers to policies,

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

Patrick Schmidt, Christoph Sterz NUMA-aware SURF Speeded Up Robust Features Object detection

What is harm reduction? The International Harm Reduction Association (IHRA) defines harm

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

Claude TADONKI MINES ParisTech PSL Research University Centre de Recherche Informatique

Energy-aware job scheduler for high- performance computing 7.9.2011 Olli Mmmel (VTT), Mikko

Instruction Scheduling List scheduling [Gibbons & Muchnick 86] Reorder instructions to

Chapter 6 Cloud Resource Management and Scheduling Contents Resource management and

5 CPU Scheduling (1)

Online Algorithms Lectures 1 and 2 Ji r Sgall Computer Science Institute of the Charles

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Sticky Expectations and Consumption Dynamics Christopher D. Carroll 1 Edmund Crawley 2 Jiri

Improving C HARM ++ Performance with a NUMA-aware Load Balancer - PowerPoint PPT Presentation

9 th Annual Workshop on C HARM ++ and its Applications Improving C HARM ++ Performance with a NUMA-aware Load Balancer Larcio Lima Pilla 1,2 , Christiane Pousa 2 , Daniel Cordeiro 2,3 , Abhinav Bhatele 4 , Philippe O. A. Navaux 1 , Jean-Franois

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

Introduction to Harm Reduction Definition of Harm Reduction Harm reduction refers to policies,

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

Patrick Schmidt, Christoph Sterz NUMA-aware SURF Speeded Up Robust Features Object detection

What is harm reduction? The International Harm Reduction Association (IHRA) defines harm

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

Claude TADONKI MINES ParisTech PSL Research University Centre de Recherche Informatique

Energy-aware job scheduler for high- performance computing 7.9.2011 Olli Mmmel (VTT), Mikko

Instruction Scheduling List scheduling [Gibbons &amp; Muchnick 86] Reorder instructions to

Chapter 6 Cloud Resource Management and Scheduling Contents Resource management and

5 CPU Scheduling (1)

Online Algorithms Lectures 1 and 2 Ji r Sgall Computer Science Institute of the Charles

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Sticky Expectations and Consumption Dynamics Christopher D. Carroll 1 Edmund Crawley 2 Jiri

Instruction Scheduling List scheduling [Gibbons & Muchnick 86] Reorder instructions to