a case for numa aware contention management on multicore
play

A Case for NUMA-aware Contention Management on Multicore Systems - PowerPoint PPT Presentation

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov sergey_blagodurov@sfu.ca Sergey Zhuravlev sergey_zhuravlev@sfu.ca Mohammad Dashti mohammad_dashti@sfu.ca Alexandra Fedorova alexandra_fedorova@sfu.ca USENIX


  1. A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov sergey_blagodurov@sfu.ca Sergey Zhuravlev sergey_zhuravlev@sfu.ca Mohammad Dashti mohammad_dashti@sfu.ca Alexandra Fedorova alexandra_fedorova@sfu.ca USENIX ATC’11 / Scheduling session 15 th of June

  2. NUMA Domain 0 Core 0 Core 1 Core 2 Core 3 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache Shared L3 Cache System Request Interface 
 Crossbar switch Memory HyperTransport Controller to other domains Memory node 0 An AMD Opteron 8356 Barcelona domain USENIX ATC’11 / Scheduling session -2-

  3. NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 An AMD Opteron system with 4 domains USENIX ATC’11 / Scheduling session -3-

  4. NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Contention for the shared last-level cache (CA) USENIX ATC’11 / Scheduling session -4-

  5. NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Contention for the memory controller (MC) USENIX ATC’11 / Scheduling session -5-

  6. NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Contention for the inter-domain interconnect (IC) USENIX ATC’11 / Scheduling session -6-

  7. NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Remote access latency (RL) USENIX ATC’11 / Scheduling session -7-

  8. NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 Core 8 Core 12 Core 7 B Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 1 node 0 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Isolating Memory controller contention (MC) USENIX ATC’11 / Scheduling session -8-

  9. Memory Controller (MC) and InterConnect (IC) contention are key factors hurting performance Dominant degradation factors USENIX ATC’11 / Scheduling session -9-

  10. Characterization method  Given two threads, decide if they will hurt each other ʼ s performance if co-scheduled A B Scheduling algorithm  Separate threads that are expected to interfere A B Contention-Aware Scheduling USENIX ATC’11 / Scheduling session -10-

  11. Limited observability  We do not know for sure if threads compete and how severely  Hardware does not tell us Trial and error infeasible on large systems  Can ʼ t try all possible combinations  Even sampling becomes difficult A good trade-off: measure LLC Miss rate!  Assumes that threads interfere if they have high miss rates  No account for cache contention impact  Works well because cache contention is not dominant Characterization Method USENIX ATC’11 / Scheduling session -11-

  12. Sort threads by LLC missrate: A B X Y Goal: isolate threads that compete for shared resources High contention: Low contention? A Y A Y X B B X MC HT MC HT MC HT MC HT A C B D Memory Memory Memory Memory node 1 node 2 node 1 node 2 Domain 1 Domain 2 Domain 1 Domain 2 Migrate competing threads to different domains Our previous work: an algorithm for UMA systems Distributed Intensity (DI-Plain) USENIX ATC’11 / Scheduling session -12-

  13. NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 B Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache Shared L3 Cache HT MC HT MC Memory Memory node 0 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Failing to migrate memory leaves MC and introduces RL USENIX ATC’11 / Scheduling session -13-

  14. SPEC CPU 2006 SPEC MPI 2007 % improvement over DEFAULT DI-Plain hurts performance on NUMA systems because it does not migrate memory! USENIX ATC’11 / Scheduling session -14-

  15. Sort threads by LLC missrate: A B X Y Goal: isolate threads that compete for shared resources and pull the memory to the local node upon migration A Y A Y X B B X MC HT MC HT MC HT MC HT A C B D Memory Memory Memory Memory node 1 node 2 node 1 node 2 Domain 1 Domain 2 Domain 1 Domain 2 Migrate competing threads along with memory to different domains Solution #1: Distributed Intensity with memory migration (DI-Migrate) USENIX ATC’11 / Scheduling session -15-

  16. SPEC CPU 2006 (low migration rate) % improvement over DEFAULT SPEC MPI 2007 (high migration rate) DI-Migrate performs too many migrations for MPI. Migrations are expensive on NUMA systems. USENIX ATC’11 / Scheduling session -16-

  17. NUMA Domain 0 NUMA Domain 1 Core 3 Core 0 A Core 4 B Core 8 Core 12 Core 7 Core 11 Core 15 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Shared L3 Cache Shared L3 Cache Shared L3 Cache HT MC MC HT MC Memory Memory Memory node 0 node 1 node 1 Memory Memory node 2 node 3 MC HT HT MC NUMA Domain 2 NUMA Domain 3 Shared L3 Cache Shared L3 Cache L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 L1, L2 cache cache cache cache cache cache cache cache Core 2 Core 6 Core 10 Core 14 Core 1 Core 5 Core 9 Core 13 Migrating too frequently causes IC USENIX ATC’11 / Scheduling session -13-

  18. DI-Migrate: threads sorted by miss rate if array positions change, we migrate thread and memory 2 5 7 12 21 35 47 110 150 200 1 3 7 15 27 51 78 92 170 190 DINO: threads sorted by class only migrate if we jump from one class to another C1 <= 10 10 < C2 <= 100 100 < C3 2 5 7 12 21 35 47 110 150 200 C1 <= 10 10 < C2 <= 100 100 < C3 1 3 7 15 27 51 78 92 170 190 Solution #2: Distributed Intensity NUMA Online (DINO) USENIX ATC’11 / Scheduling session -17-

  19. Loose correlation between miss rate and degradation, so most migrations will not payoff USENIX ATC’11 / Scheduling session -18-

  20. Average number of memory migrations per hour of execution (DI-Migrate and DINO) DINO significantly reduces the number of migrations USENIX ATC’11 / Scheduling session -19-

  21. SPEC CPU 2006 SPEC MPI 2007 LAMP % improvement over DEFAULT DINO results USENIX ATC’11 / Scheduling session -20-

Recommend


More recommend