A Case for NUMA-aware Contention Management on Multicore Systems - PowerPoint PPT Presentation

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov sergey_blagodurov@sfu.ca Sergey Zhuravlev sergey_zhuravlev@sfu.ca Mohammad Dashti mohammad_dashti@sfu.ca Alexandra Fedorova alexandra_fedorova@sfu.ca USENIX ATC’11 / Scheduling session 15 th of June

NUMA Domain 0 Core 0 Core 1 Core 2 Core 3 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache Shared L3 Cache System Request Interface   Crossbar switch Memory HyperTransport Controller to other domains Memory node 0 An AMD Opteron 8356 Barcelona domain USENIX ATC’11 / Scheduling session -2-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 An AMD Opteron system with 4 domains USENIX ATC’11 / Scheduling session -3-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Contention for the shared last-level cache (CA) USENIX ATC’11 / Scheduling session -4-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Contention for the memory controller (MC) USENIX ATC’11 / Scheduling session -5-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Contention for the inter-domain interconnect (IC) USENIX ATC’11 / Scheduling session -6-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Remote access latency (RL) USENIX ATC’11 / Scheduling session -7-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 Core 8 Core 12 Core 7 B Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 1 node 0 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Isolating Memory controller contention (MC) USENIX ATC’11 / Scheduling session -8-

Memory Controller (MC) and InterConnect (IC) contention are key factors hurting performance Dominant degradation factors USENIX ATC’11 / Scheduling session -9-

Characterization method  Given two threads, decide if they will hurt each other ʼ s performance if co-scheduled A B Scheduling algorithm  Separate threads that are expected to interfere A B Contention-Aware Scheduling USENIX ATC’11 / Scheduling session -10-

Limited observability  We do not know for sure if threads compete and how severely  Hardware does not tell us Trial and error infeasible on large systems  Can ʼ t try all possible combinations  Even sampling becomes difficult A good trade-off: measure LLC Miss rate!  Assumes that threads interfere if they have high miss rates  No account for cache contention impact  Works well because cache contention is not dominant Characterization Method USENIX ATC’11 / Scheduling session -11-

Sort threads by LLC missrate: A B X Y Goal: isolate threads that compete for shared resources High contention: Low contention? A Y A Y X B B X MC HT MC HT MC HT MC HT A C B D Memory Memory Memory Memory node 1 node 2 node 1 node 2 Domain 1 Domain 2 Domain 1 Domain 2 Migrate competing threads to different domains Our previous work: an algorithm for UMA systems Distributed Intensity (DI-Plain) USENIX ATC’11 / Scheduling session -12-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 B Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Failing to migrate memory leaves MC and introduces RL USENIX ATC’11 / Scheduling session -13-

SPEC CPU 2006 SPEC MPI 2007 % improvement over DEFAULT DI-Plain hurts performance on NUMA systems because it does not migrate memory! USENIX ATC’11 / Scheduling session -14-

Sort threads by LLC missrate: A B X Y Goal: isolate threads that compete for shared resources and pull the memory to the local node upon migration A Y A Y X B B X MC HT MC HT MC HT MC HT A C B D Memory Memory Memory Memory node 1 node 2 node 1 node 2 Domain 1 Domain 2 Domain 1 Domain 2 Migrate competing threads along with memory to different domains Solution #1: Distributed Intensity with memory migration (DI-Migrate) USENIX ATC’11 / Scheduling session -15-

SPEC CPU 2006 (low migration rate) % improvement over DEFAULT SPEC MPI 2007 (high migration rate) DI-Migrate performs too many migrations for MPI. Migrations are expensive on NUMA systems. USENIX ATC’11 / Scheduling session -16-

NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 B Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache Shared L3 Cache HT MC MC HT MC Memory Memory Memory node 0 node 1 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Migrating too frequently causes IC USENIX ATC’11 / Scheduling session -13-

DI-Migrate: threads sorted by miss rate if array positions change, we migrate thread and memory 2 5 7 12 21 35 47 110 150 200 1 3 7 15 27 51 78 92 170 190 DINO: threads sorted by class only migrate if we jump from one class to another C1 <= 10 10 < C2 <= 100 100 < C3 2 5 7 12 21 35 47 110 150 200 C1 <= 10 10 < C2 <= 100 100 < C3 1 3 7 15 27 51 78 92 170 190 Solution #2: Distributed Intensity NUMA Online (DINO) USENIX ATC’11 / Scheduling session -17-

Loose correlation between miss rate and degradation, so most migrations will not payoff USENIX ATC’11 / Scheduling session -18-

Average number of memory migrations per hour of execution (DI-Migrate and DINO) DINO significantly reduces the number of migrations USENIX ATC’11 / Scheduling session -19-

SPEC CPU 2006 SPEC MPI 2007 LAMP % improvement over DEFAULT DINO results USENIX ATC’11 / Scheduling session -20-

A Case for NUMA-aware Contention Management on Multicore Systems - PowerPoint PPT Presentation

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov sergey_blagodurov@sfu.ca Sergey Zhuravlev sergey_zhuravlev@sfu.ca Mohammad Dashti mohammad_dashti@sfu.ca Alexandra Fedorova alexandra_fedorova@sfu.ca USENIX

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

Performance Impact of Resource Contention in Multicore Systems R. Hood, H. Jin, P. Mehrotra, J.

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems

Minimizing MPI Resource Contention in Multithreaded Multicore Environments Dave Goodell , 1 Pavan

Shuffling: A Lock Contention Aware Thread Scheduling Technique Kishore Pusukuri Multicores are

Multicore Workshop NUMA Mark Bull David Henty EPCC, University of Edinburgh Distributed

Improving C HARM ++ Performance with a NUMA-aware Load Balancer Larcio Lima Pilla 1,2 ,

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan,

Patrick Schmidt, Christoph Sterz NUMA-aware SURF Speeded Up Robust Features Object detection

Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age Viktor

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of

LIRA: Adaptive Contention-Aware Thread Placement for Parallel Runtime Systems Alexander Collins*,

Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems Yufei

NUMA-Aware Thread Migration for High Performance NVMM File Systems Ying Wang , Dejun Jiang, Jin

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) Scientific Applications on Multicore Systems

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

THE EVOLUTION AND ARCHITECTURE Professor Ken Birman OF MODERN COMPUTERS CS4414 Lecture 2

IS TOPOLOGY IMPORTANT AGAIN? Effects of contention on message latencies in large supercomputers

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

A Case for NUMA-aware Contention Management on Multicore Systems - PowerPoint PPT Presentation

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov sergey_blagodurov@sfu.ca Sergey Zhuravlev sergey_zhuravlev@sfu.ca Mohammad Dashti mohammad_dashti@sfu.ca Alexandra Fedorova alexandra_fedorova@sfu.ca USENIX

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

Performance Impact of Resource Contention in Multicore Systems R. Hood, H. Jin, P. Mehrotra, J.

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems

Minimizing MPI Resource Contention in Multithreaded Multicore Environments Dave Goodell , 1 Pavan

Shuffling: A Lock Contention Aware Thread Scheduling Technique Kishore Pusukuri Multicores are

Multicore Workshop NUMA Mark Bull David Henty EPCC, University of Edinburgh Distributed

Improving C HARM ++ Performance with a NUMA-aware Load Balancer Larcio Lima Pilla 1,2 ,

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan,

Patrick Schmidt, Christoph Sterz NUMA-aware SURF Speeded Up Robust Features Object detection

Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age Viktor

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of

LIRA: Adaptive Contention-Aware Thread Placement for Parallel Runtime Systems Alexander Collins*,

Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems Yufei

NUMA-Aware Thread Migration for High Performance NVMM File Systems Ying Wang , Dejun Jiang, Jin

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) Scientific Applications on Multicore Systems

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

THE EVOLUTION AND ARCHITECTURE Professor Ken Birman OF MODERN COMPUTERS CS4414 Lecture 2

IS TOPOLOGY IMPORTANT AGAIN? Effects of contention on message latencies in large supercomputers

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA