Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems Mohammad Dashti 1 , Alexandra Fedorova 1 , Justin Funston 1 , Fabien Gaud 1 , Renaud Lachaize 2 , Baptiste Lepers 3 , Vivien ema 4 , Mark Roth 1 Qu´ 1 Simon Fraser University 2 Universit´ e Joseph Fourier 3 CNRS 4 Grenoble INP March 19, 2013 Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 1 / 20
New multicore machines are NUMA Core Core Core Core Core Core Core Core NODE 0 NODE 1 240 cycles / 5.5GB/ s 300 cycles / 2.8GB/ s M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 2 / 20
Well-know issue: remote access latency overhead Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C 300 cycles M M DRAM DRAM NODE 2 C C NODE 3 Thread Core Core Core Core Core Core Core Core Memory ◮ Impacts performance by at most 30% Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 3 / 20
New issue: Memory controller and interconnect congestion Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C 1200 cycles M M DRAM DRAM NODE 2 C C NODE 3 Thread Core Core Core Core Core Core Core Core Memory Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 4 / 20
Current solutions ◮ Try to improve locality ◮ Thread scheduling and page migration (USENIX ATC’11) ◮ Thread Clustering (EuroSys’07) ◮ Page replication (ASPLOS’96) ◮ Etc. ◮ But the main problem is MC/interconnect congestion Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 5 / 20
MC/Interconnect congestion impact on performance ◮ 16 threads, one per core ◮ Memory either allocated on first touch or interleaved Example: Streamcluster 25% 25% 1% 97% 25% 25% 1% 1% First touch scenario Interleave scenario Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 6 / 20
MC/Interconnect congestion impact on performance (2) between best and worst policy Performance difference (%) 100 80 Up to 100% performance difference 60 40 20 0 BT CG DC EP FT IS LU MG SP UA bodytrack facesim fluidanimate streamcluster swaptions x264 kmeans matrixmult PCA (I) wrmem Best policy is First Touch Best policy is Interleaving Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 7 / 20
Why do applications benefit from interleaving? (1) Streamcluster Interleaving First touch Local access ratio 25% 25% Memory latency (cycles) 471 1169 Memory controller imbalance 7% 200% Interconnect imbalance 21% 86% Perf. improvement / first touch 105% - ⇒ Interconnect and memory controller congestion drive up memory access latency Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 8 / 20
Why do applications benefit from interleaving? (2) PCA Interleaving First touch Local access ratio 25% 33% Memory latency (cycles) 480 665 Memory controller imbalance 4% 154% Interconnect imbalance 19% 64% Perf. improvement / first touch 38% - ⇒ Balancing load on memory controllers is more important than improve locality Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 9 / 20
Conclusions ◮ Balance is more important than locality ◮ Memory controller and interconnect congestion can drive up access latency ◮ Always manually interleaving memory is NOT the way to go Manual interleaving Performance improvement 10 with respect to Linux (%) 0 -10 -20 -30 -40 BT CG DC EP FT IS LU MG SP UA ⇒ Need a new solution Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 10 / 20
Carrefour: A new memory traffic management algorithm ◮ First goal: balance memory pressure on interconnect and MC ◮ Second goal: improve locality Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 11 / 20
Mechanism #1: Page relocation Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 12 / 20
Mechanism #1: Page relocation Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM ` NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core � Better locality � Cannot be applied if region � Lower interconnect load is shared by multiple threads � Balanced load on MC Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 12 / 20
Mechanism #2: Page replication Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 13 / 20
Mechanism #2: Page replication Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core � Better locality � Higher memory consumption � Lower interconnect load � Expensive synchronization � Balanced load on MC Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 13 / 20
Mechanism #3: Page interleaving Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 14 / 20
Mechanism #3: Page interleaving Core Core Core Core Core Core Core Core NODE 0 NODE 1 M M DRAM DRAM C C M M DRAM DRAM NODE 2 C C NODE 3 Core Core Core Core Core Core Core Core � Balanced load on interconnect � Can decrease locality � Balanced load on MC Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 14 / 20
Carrefour in details ◮ Goal: Combine these techniques to: 1. Balance memory pressure 2. Increase locality Per application profiling Per application decisions Per page decisions Memory congestion ? Global application m etrics Memory intensity Memory imbalance Enable migrations ? Local access ratio Memory read ratio Migrate / Interleave / Enable interleaving ? Replicate page Enable replications ? Per page m etrics RW ratio Accessed by nodes Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 15 / 20
Carrefour in details Per application profiling Per application decisions Per page decisions Memory congestion ? Global application m etrics H Memory intensity W Memory imbalance Expensive ! Enable migrations ? C Local access ratio Memory read ratio Migrate / Interleave / Enable interleaving ? Replicate page Enable replications ? Expensive ! I Per page m etrics B RW ratio S Accessed by nodes ◮ Accurate and low-overhead page access statistics ◮ Adaptive IBS sampling ◮ Include cache accesses ◮ Use hardware counter feedback Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 15 / 20
Carrefour in details Per application profiling Per application decisions Per page decisions Memory congestion ? Global application m etrics H Memory intensity W Memory imbalance Expensive ! Enable migrations ? C Local access ratio Memory read ratio Migrate / Interleave / Enable interleaving ? Replicate page Enable replications ? Expensive ! I Per page m etrics B RW ratio S Accessed by nodes ◮ Efficient page replication ◮ Use a careful implementation (fine grain locks) ◮ Prevent data synchronization Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 15 / 20
Evaluation ◮ Carrefour is implemented in Linux 3.6 ◮ Machines ◮ 16 cores, 4 nodes, 64 GB of RAM ◮ 24 cores, 4 nodes, 64 GB of RAM ◮ Benchmarks (23 applications) ◮ Parsec ◮ FaceRec ◮ Metis (Map/Reduce) ◮ NAS ◮ Compare Carrefour to ◮ Linux (default) ◮ Linux Autonuma ◮ Manual Interleaving Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 16 / 20
Performance 270 Performance improvement AutoNUMA with respect to Linux (%) 240 Carrefour 210 180 150 120 90 60 30 0 -30 F S F F P E S a a a t C P P c r c c e A e e e a s R R m i m e e c c c l u L s o t n e g r ⇒ Carrefour significantly improves performance ! Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 17 / 20
Carrefour overhead Configuration Maximum overhead / default Autonuma 25% Carrefour 4% ◮ Carrefour average overhead when no decision are taken: 2% Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 18 / 20
Conclusion ◮ In modern NUMA systems: ◮ Remote latency overhead is not the main bottleneck ◮ MC and interconnect congestion can drive up memory latency ◮ Carrefour: a memory traffic management algorithm ◮ First goal: balance memory pressure on interconnect and MC ◮ Second goal: improve locality ◮ Performance: ◮ Improves performance significantly (up to 270%) ◮ Outperforms others solutions Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems 19 / 20
Recommend
More recommend