thread and memory placement on numa systems asymmetry
play

Thread and Memory Placement on NUMA Systems: Asymmetry Matters - PowerPoint PPT Presentation

Thread and Memory Placement on NUMA Systems: Asymmetry Matters Baptiste Lepers, Alexandra Fedorova (Simon Fraser University), Vivien Quma (Grenoble INP) ATC 2015 1 / 12 Introduction Current threads and memory placement: minimizing hop-count


  1. Thread and Memory Placement on NUMA Systems: Asymmetry Matters Baptiste Lepers, Alexandra Fedorova (Simon Fraser University), Vivien Quéma (Grenoble INP) ATC 2015 1 / 12

  2. Introduction Current threads and memory placement: minimizing hop-count (e.g. in Linux). Contributions: ◮ Connections are asymmetric, bandwidth is more important than hops. ◮ AsymSched algorithm that dynamically places threads and memory. 2 / 12

  3. Inter-node bandwidths for 4 AMD Opteron 6272 processors Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 8b link 16b link 16b/8b link 3 / 12

  4. Node 0 Node 4 Node 5 Node 1 Node 0 Node 4 Node 5 Node 1 Node 6 Node 2 Node 3 Node 7 Node 6 Node 2 Node 3 Node 7 8b link 16b link 16b/8b link 8b link 16b link 16b/8b link Measurements Applications running on 3 nodes, with different node placements. Perf. improvement relative to average placement (%) 15 40 100 Worst Placement 30 80 10 Best Placement 20 60 5 10 40 0 0 20 -10 0 -5 -20 -20 -10 -30 -40 -15 -40 -60 b c e f i s l u m s u s k m w w w g s s p f g t p w m p t a t p . . . a c r r r r c . . C D B g . a m a e a c B C . A . a e e e C . . B t p c . x . . C p a r e a r x . . x x . . i h j x x . x x t n x m b m e x i m 5 o s b c 0 c n u l 0 u s l t s i p t e l y r Figure 2: Performance difference between the best, and worst thread placement with respect to the average thread average placement (cycles) accesses compared to Latency of memory 150 200 800 Worst Placement 150 600 100 Best Placement 400 100 50 200 50 0 0 0 -200 -50 -50 -400 -100 -600 -100 -150 -800 -150 -200 -1000 b c e f i l m s u s k m w w w g s t s u s p f t g p . p a w m r p t a . C . . g a c r r r c B . . D B . . a m a e a c C C . A B e t e e . . . . C p r p c x . x x x . . a i e a r x . x x t x h j x . i n m b m e x o m 5 s b c n 0 c u s 0 l u l t s i p t e l y r Figure 3: Difference in latency of memory accesses between the best, and worst thread placement with respect to the 4 / 12

  5. More Measurements streamcluster running on 2 nodes, with different node placements. Master thread Execution Time Diff with Latency of memory % accesses Bandwidth to node (s) 0-1 (%) accesses (cycles) via 2-hop the “master” (compared to 0-1(%)) links node (MB/s) 0 1 - 148 0% 750 0 5598 0 4 - 228 56% 1169 (56%) 0 2999 0 228 56% 1179 (57%) 0 2973 0 2 855 2 168 15% (14%) 0 4329 2 0 340 133% 1527 (104%) 98 1915 0 3 3 185 27% 1040 (39%) 98 3741 1 0 340 133% 1601 (113%) 98 1903 0 5 4 5 228 56% 1206 (61%) 98 2884 2 3 185 27% 1020 (36%) 0 3748 3 7 7 338 132% 1614 (115%) 98 1928 4 1 338 132% 1612 (115%) 98 1891 5 1 5 230 58% 1200 (60%) 0 2880 2 167 15% 867 (16%) 98 3748 2 7 7 225 54% 1220 (63%) 0 3014 3 4 230 58% 1205 (60%) 0 2959 4 1 1 226 55% 1203 (60%) 98 2880 5 5 / 12

  6. AsymSched ◮ User-level thread+memory placement manager ◮ Continuously measures communication ◮ Decides every second whether threads/memory should be migrated 6 / 12

  7. AsymSched – Measurement ◮ Reads some hardware counter (data accesses from CPU to node) ◮ No counter for CPU to CPU available ◮ Assumes for decision making: ◮ Threads on same node share data ◮ Between nodes with ’high’ communication threads of same application share data. 7 / 12

  8. AsymSched – Decision ◮ Puts threads of same application that share data into clusters. ◮ Each cluster gets weight C w = log ( #remote memory accesses ) . ◮ For each placement (mapping of clusters to nodes), compute P w = � C ∈ Clusters C w · ( max bandwidth for C ) . ◮ Select placements whose P w ≥ 90 % of maximal P w . Of those choose that with least page migrations. ◮ If cost for memory migration (assuming 0 . 3s per GB) is too high, do not apply placement. ◮ Because of symmetry, not all placements need to be tested. Also “obviously bad” placement are ignored. 8 / 12

  9. AsymSched – Migration ◮ Uses dynamic (lazy) migration . ◮ If after 2 seconds > 90 % of accesses go to old node, do full migration. ◮ Full migration uses special system call, that is faster than migrate_pages , because it stops the application and needs less locks. cg.B ft.C is.D sp.A streamcluster graph500 specJBB Migrated memory (GB) 0.17 2.5 20 0.1 0.15 0.3 10 Average time - Linux syscall (ms) 860 12700 101000 490 750 1500 50500 Average time - fast migration (ms) 51 380 3050 30 45 90 1500 9 / 12

  10. Evaluation – 1 application on 3 nodes Perf. improvement relative to average placement (%) Worst placement Dynamic Memory Placement Only Best placement AsymSched 15 40 250 30 10 200 20 5 150 10 0 0 100 -10 -5 50 -20 -10 0 -30 -15 -40 -50 bt.B.x cg.C.x ep.C.x ft.C.x is.D.x lu.B.x mg.C.x sp.A.x ua.B.x swaptions kmeans matrixmultiply wc wr wrmem graph500 specjbb streamcluster pca facerec Figure 4: Performance difference between the best and worst static thread placement, dynamic memory placement, average placement (cycles) Worst Placement Dynamic Memory Placement Only accesses compared to Best Placement AsymSched Latency of memory 200 1500 250 150 200 1000 100 150 50 500 100 0 50 0 -50 0 -100 -500 -50 -150 -100 -200 -1000 bt.B.x cg.C.x ep.C.x ft.C.x is.D.x lu.B.x mg.C.x sp.A.x ua.B.x swaptions kmeans matrixmultiply wc wr wrmem graph500 specjbb streamcluster pca facerec Figure 5: Memory latency under the best and worst static thread placement, dynamic memory placement, AsymSched 10 / 12

  11. Evaluation – 3 applications Perf. improvement relative to average placement (%) 250 Worst Thread Placement 200 Best Thread Placement 150 Dynamic Memory Placement 100 AsymSched 50 0 -50 s g m s g s s s s s m s s p r t r p t t t p p t a a r a r r r a r e e e e e e e e e p t p t c r a c a a a c r c a j h i h j j i j b x m b m m m b x b m 5 m 5 m b c b c c c b b c 0 0 - u l - l l l - u - l 3 0 u 0 2 u u u 5 5 u - l - l t s s s s t s 3 i p t 3 t t t p i t e e e e e l l y r r r r y r - - - - - - 3 3 3 2 - 3 2 3 average placement (cycles) accesses compared to 2000 Latency of memory Worst Thread Placement 1500 Best Thread Placement 1000 Dynamic Memory Placement AsymSched 500 0 -500 -1000 -1500 specjbb-3 graph500-3 matrixmultiply-2 streamcluster-3 graph500-3 specjbb-2 streamcluster-3 streamcluster-3 streamcluster-2 specjbb-5 matrixmultiply-3 specjbb-5 streamcluster-3 11 / 12

  12. Discussion ◮ What’s the matter with memory migration? ◮ How well would this work without the magic constants? ◮ What if #threads is not a multiple of #cores in NUMA-domain? 12 / 12

Recommend


More recommend