numa obliviousness through memory mapping
play

NUMA obliviousness through memory mapping Mrunal Gawade - PowerPoint PPT Presentation

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1 st June 2015) Melbourne, Australia NUMA architecture Intel Xeon E5-4657L v2 @2.40GHz Memory mapping What is it? Operating system maps


  1. NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1 st June 2015) Melbourne, Australia

  2. NUMA architecture Intel Xeon E5-4657L v2 @2.40GHz

  3. Memory mapping What is it? Operating system maps disk files to memory E.g. Executable file mapping How is it done? System call – mmap(), munmap() Relevance for the Database world? In memory columnar storage disk files mapped to memory

  4. Motivation Memory mapped columnar storage and NUMA effects, in analytic workload

  5. TPC-H Q1 … (4 sockets, 100GB, MonetDB) 3 5 3 0 2 5 ) c e s 2 0 ( e m 1 5 i T 1 0 5 0 0 , 1 0 0 , 2 1 , 2 2 0 , 3 3 S o c k e t s o n w h i c h m e m o r y i s a l l o c a t e d numactl -N 0,1 -m “Varied between sockets 0-3” “Database server process”

  6. Contributions  NUMA oblivious (shared-everything) is relatively good compared to NUMA aware (shared-nothing). (using SQL workload)  Effect of memory mapping on NUMA obliviousness insights. (using micro-benchmarks)  Distributed database system using multi-sockets (shared- nothing) reduces remote memory accesses.

  7. NUMA oblivious vs NUMA aware plans NUMA_Shard- (Variation of NUMA_Distr- (shared nothing) NUMA_Obliv- (shared    NUMA_Obliv) everything) Socket aware plans in MonetDB Shard aware plans in Default parallel plans in “Lineitem” and “Orders” table MonetDB MonetDB sharded in 4 pieces(orderkey), “Lineitem” and “Orders” table and sliced Only “Lineitem” table is sharded in 4 pieces sliced Dimension tables replicated (orderkey) and sliced

  8. System configuration  Intel Xeon E5-4657L v2 @2.40GHz, 4 sockets, 12 cores per socket (total 96 threads with Hyper-threading)  Cache - L1=32KB, L2 =256KB, shared L3=30MB.  1TB four channel DDR3 memory, (256 GB memory / socket).  O.S. - Fedora 20 Data-set- TPC-H 100GB  Tools – numactl, Intel PCM, Linux Perf  MonetDB open-source system with memory mapped columnar storage

  9. TPC-H performance 6 N U M A _ O b l i v 5 N U M A _ S h a r d N U M A _ D i s t r ) 4 c e s ( 3 e m i T 2 1 0 4 6 1 5 1 9 T P C - H Q u e r i e s NUMA_Shard is a variation of NUMA_Obliv with sharded & partitioned “orders” table.

  10. Micro-experiments on modified Q6 Why Q6? - select count(*) from lineitem where l_quantity > 24000000;  Selection on “lineitem” table  Easily parallelizable  NUMA effects analysis is easy (read only query)

  11. Process and memory affinity Example: numactl -C 0-11,12-23,24-35 -m 0,1,2 “Database Server” Socket 0 Socket 1 Socket 2 Socket 3 cores 0-11 12-23 24-35 36-47 cores 48-59 60-71 72-83 84-95

  12. NUMA_Obliv Micro-experiments on Q6

  13. Local vs Remote memory access 8 0 8 0 8 0 s s s n n n L o c a l m e m o r y a c c e s s L o c a l m e m o r y a c c e s s L o c a l m e m o r y a c c e s s o o o i 7 0 i 7 0 i 7 0 l l l R e m o t e m e m o r y a c c e s s R e m o t e m e m o r y a c c e s s R e m o t e m e m o r y a c c e s s l l l i i i M M M 6 0 6 0 6 0 n n n i i i s 5 0 s 5 0 s 5 0 e e e s s s s s s 4 0 4 0 4 0 e e e c c c c c c 3 0 3 0 3 0 a a a y y y r r r 2 0 o 2 0 2 0 o o m m m e 1 0 e e 1 0 1 0 M M M 0 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)

  14. Execution time (Robustness) 3 5 0 3 5 0 3 5 0 3 0 0 3 0 0 ) 3 0 0 ) s s d d d n n n o 2 5 0 2 5 0 2 5 0 o c o c e c s e e - 2 0 0 s 2 0 0 i s 2 0 0 l - - l i i i l l l l mi i m ( 1 5 0 1 5 0 m 1 5 0 ( ( e me e m i 1 0 0 1 0 0 1 0 0 i m T i T T 5 0 5 0 5 0 0 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no Most robust Less robust Least robust Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)

  15. Increase in threads = more remote accesses?

  16. Distribution of mapped pages 1 0 0 s s o c k e t 3 e g s o c k e t 2 a s o c k e t 1 p 8 0 d s o c k e t 0 e p p 6 0 a m f o 4 0 n o i t r o 2 0 p o r P 0 1 2 2 4 3 6 4 8 N u m b e r o f t h r e a d s /proc/process id/numa maps

  17. # CPU migrations 2 0 0 s n 1 5 0 o i t a r g i m 1 0 0 U P C # 5 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s

  18. Why remote accesses are bad? 1 6 0 1 4 0 ) s d 1 2 0 n #Local Access # Remote Access o c 1 0 0 e s NUMA_Obliv 69 Million (M) 136 M - i l l 8 0 i m ( 6 0 e NUMA_Distr 196 M 9 M m i 4 0 T 2 0 0 N U M A _ O b l i vN U M A _ D i s t r M o d i fj e d T P C - H Q 6

  19. NUMA_Distr to minimize remote accesses ?

  20. Comparison with Vectorwise 6 M o n e t D B N U M A _ S h a r d 5 M o n e t D B N U M A _ D i s t r ) 4 c V e c t o r _ D e f e s ( V e c t o r _ D i s t r 3 e m i T 2 1 0 4 6 1 5 1 9 T P C - H Q u e r i e s Vectorwise has no NUMA awareness and also uses a dedicated buffer manager

  21. Comparison with Hyper 3 . 5 M o n e t D B N U M A _ D i s t r 3 H y p e r 2 . 5 ) 1.15 c e s 2 ( e m 1 . 5 i T 2.3 1 5.7 2.5 0 . 5 2 0 4 6 9 1 2 1 4 1 5 1 9 T P C - H Q u e r i e s The RED numbers indicate speed-up of Hyper over MonetDB NUMA_Distr plans. Hyper generates NUMA aware, LLVM JIT compiled fused operator pipeline plans.

  22. Conclusion ● NUMA obliviousness fares reasonably to NUMA awareness. ● Process and memory affinity helps NUMA oblivious plans to perform robustly. ● Simple distributed shared nothing database configuration can compete with the state of the art database.

  23. Thank you

Recommend


More recommend