9 th Annual Workshop on C HARM ++ and its Applications Improving C HARM ++ Performance with a NUMA-aware Load Balancer Laércio Lima Pilla 1,2 , Christiane Pousa 2 , Daniel Cordeiro 2,3 , Abhinav Bhatele 4 , Philippe O. A. Navaux 1 , Jean-François Méhaut 2 , Laxmikant V. Kale 4 1 Federal University of Rio Grande do Sul – Porto Alegre, Brazil 2 Grenoble University – Grenoble, France 3 University of São Paulo – São Paulo, Brazil 4 University of Illinois at Urbana-Champaign – Urbana, IL, USA
Summary How we used NUMA architectural information to build a C HARM ++ load balancer and obtained improvements on overall performance. /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 2
Agenda NUMA Our Load Balancer: N UMA LB Experimental Setup Results Concluding Remarks /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 3
UMA x NUMA Uniform Memory Access Non-Uniform Memory Access • Centralized shared memory • Distributed shared memory – Uniform latencies – Non-uniform latencies • Data placement does not • Data placement matters matter Address space P P P P M M M M Interconnection P P P P Memory Interconnection Processor /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 4
NUMA Reduce latencies Core C C C C C C C C M0 M1 M0 M1 C C C C C C C C C C C C C C C C M2 M3 M2 M3 C C C C C C C C C C C C C C C C M4 M5 M4 M5 C C C C C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 5
NUMA Reduce contention/improve bandwidth C C C C C C C C M0 M1 M0 M1 C C C C C C C C C C C C C C C C M2 M3 M2 M3 C C C C C C C C C C C C C C C C M4 M5 M4 M5 C C C C C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 6
NUMA C HARM ++ does not consider these characteristics Physical organization C HARM ++ ’s vision (UMA) No memory hierarchy C C C C No locality M0 M1 C C C C C C C C C C C C C C C C C C M2 M3 C C C C C C M C C C C C C C C C C C C M4 M5 C C C C C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 7
Agenda NUMA Our Load Balancer: N UMA LB Experimental Setup Results Concluding Remarks /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 8
Load Balancer • Application data – C HARM ++ LB framework – Processor load: execution time – Chare load: execution time – Communication graph: size and number of messages • NUMA topology – archTopology (our library) – Core to NUMA node (socket) hierarchy mapping – NUMA factor NUMA factor (i, j) = Read latency from i to j Read latency on i /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 9
Load Balancer • Heuristic – Task mapping is NP-Hard – No initial assumptions about the application • List scheduling – Put tasks on a priority list by load – Assign tasks to the processor with the smallest cost on a greedy fashion • Improve performance – by reducing unbalance – by reducing remote communication costs – while avoiding migrations (data movement costs) /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 10
Load Balancer • Cost function cost ( c , p ) = load ( p ) + ɑ × ( r comm ( c , p ) × NUMA factor – l comm ( c , p ) ) Where c : chare p : core load ( p ): load (execution time) on core p r comm ( c , p ): number of messages sent by chare c to chares on other NUMA node l comm ( c , p ): number of messages sent by chare c to chares on the same NUMA node ɑ : communication weight /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 11
Load Balancer Input : C set of chares, P set of cores, M mapping N UMA LB ’s Algorithm Output : M’ mapping of chares to cores 1. M’ ← M 2. while c ≠ Ø do for the number of chares 3. c ← v | v ϵ arg max u ϵ C load ( u ) take heaviest chare 4. C ← C \{ c } 5. p ← q , q ϵ P Ʌ {( c , q )} ϵ M get its core 6. load ( p ) ← load ( p ) − load ( c ) remove its load from its core 7. M’ ← M’ \ {( c , p )} remove from mapping 8. p’ ← q | q ϵ arg min r ϵ P cost ( c,r ) find core with smallest cost 9. load ( p’ ) ← load ( p’ ) + load ( c ) add chare load to new core 10. M’ ← M’ U {( c , p ’ )} map to new core /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 12
Agenda NUMA Our Load Balancer: N UMA LB Experimental Setup Results Concluding Remarks /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 13
Experimental Setup • 2 NUMA machines • 3 C HARM ++ benchmarks • 4 other C HARM ++ load balancers • Statistical confidence of 95% – 5% relative error – Student’s t -distribution – Minimum of 25 executions • Performance – Gains: Average iteration time (baseline = no LB) – Costs: Load balancing overhead /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 14
Experimental Setup: Machines • NUMA16 L2 L2 L2 L2 M M – AMD Opteron C C C C – 8×2 cores @ 2.2 GHz L2 L2 L2 L2 M M – 1 MB private L2 cache C C C C – 32 GB main memory L2 L2 L2 L2 – Low latency for M M C C C C memory access – Crossbar L2 L2 L2 L2 M M – NUMA factor: 1.1 – 1.5 C C C C /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 15
Experimental Setup: Machines • NUMA32 L2 L2 L2 L2 L2 L2 L2 L2 C C C C C C C C – Intel Xeon X7560 M L3 M L3 – 4×8 cores @ 2.27 GHz C C C C C C C C – 256 KB private L2 L2 L2 L2 L2 L2 L2 L2 L2 – 24 MB shared L3 L2 L2 L2 L2 L2 L2 L2 L2 – 64 GB main memory C C C C C C C C – QuickPath M L3 M L3 – NUMA factor: 1.36 – C C C C C C C C L2 L2 L2 L2 L2 L2 L2 L2 3.6 /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 16
Experimental Setup: Benchmarks • kNeighbor – Synthetic iterative benchmark where a chare communicates with other k chares at each step – Completely I/O bound – 200 chares, 16 KB messages, k = 8 • lb_test – Synthetic unbalanced benchmark with different possible communication patterns – 200 chares, random communication graph, load between 50 and 200 ms • jacobi2D – Unbalanced two-dimensional five-point stencil – 100 chares, 32² data array /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 17
Experimental Setup: LBs • G REEDY LB – Iteratively maps the most loaded chares to the least loaded cores • R EC B IPART LB – Recursive bipartition of the communication graph – Breadth-first traversal until groups the required load • M ETIS LB – Graph partitioning algorithms from METIS • S COTCH LB – Graph partitioning algorithms from SCOTCH • Neither consider the current chare mapping /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 18
Agenda NUMA Our Load Balancer: N UMA LB Experimental Setup Results Concluding Remarks /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 19
Results: kNeighbor 35 32.0 Smaller is Average iteration time (in ms) No sensible difference better 30 26.9 among LBs 30% 25 22.8 22.4 22.1 22.1 21.5 20 17.9 45% 16.9 16.1 14.9 13.5 15 10 5 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 20
Results: kNeighbor Homogeneous distribution 35 32.0 Average iteration time (in ms) Shared cache, faster communication Group chares and 30 26.9 migrate them together 30% 25 22.8 22.4 to the same core 22.1 22.1 21.5 20 17.9 45% 16.9 16.1 14.9 13.5 15 10 5 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 21
Results: lb_test 1.2 Best performance by communication-aware LBs Average iteration time (in s) 1.01 1 0.93 0.88 0.84 0.83 0.83 Best average performance 17% 0.8 0.6 0.6 0.51 0.47 0.46 0.43 0.43 28% 0.4 0.2 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 22
Results: jacobi2D 2 Best performance. 1.74 Average iteration time (in s) 1.8 Keeps proximity among chares 1.6 on a NUMA node scale 1.31 1.4 1.24 1.21 41% 1.11 S COTCH LB shows similar 1.2 1.03 performance 1 0.8 0.6 0.42 0.4 0.39 0.36 0.29 0.4 0.27 36% 0.2 0 NUMA16 NUMA32 Baseline NumaLB GreedyLB MetisLB RecBipartLB ScotchLB /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 23
Results: jacobi2D - Projections M ETIS LB: 75% efficiency • jacobi2D on NUMA16 – 2 steps before LB – 4 steps after LB N UMA LB: 93.5% efficiency • The smaller the idle parts, the higher the efficency /30 4/18/2011 Improving Charm++ Performance with a NUMA-aware Load Balancer 24
Recommend
More recommend