NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam DaMoN 2015 (1 st June 2015) Melbourne, Australia
NUMA architecture Intel Xeon E5-4657L v2 @2.40GHz
Memory mapping What is it? Operating system maps disk files to memory E.g. Executable file mapping How is it done? System call – mmap(), munmap() Relevance for the Database world? In memory columnar storage disk files mapped to memory
Motivation Memory mapped columnar storage and NUMA effects, in analytic workload
TPC-H Q1 … (4 sockets, 100GB, MonetDB) 3 5 3 0 2 5 ) c e s 2 0 ( e m 1 5 i T 1 0 5 0 0 , 1 0 0 , 2 1 , 2 2 0 , 3 3 S o c k e t s o n w h i c h m e m o r y i s a l l o c a t e d numactl -N 0,1 -m “Varied between sockets 0-3” “Database server process”
Contributions NUMA oblivious (shared-everything) is relatively good compared to NUMA aware (shared-nothing). (using SQL workload) Effect of memory mapping on NUMA obliviousness insights. (using micro-benchmarks) Distributed database system using multi-sockets (shared- nothing) reduces remote memory accesses.
NUMA oblivious vs NUMA aware plans NUMA_Shard- (Variation of NUMA_Distr- (shared nothing) NUMA_Obliv- (shared NUMA_Obliv) everything) Socket aware plans in MonetDB Shard aware plans in Default parallel plans in “Lineitem” and “Orders” table MonetDB MonetDB sharded in 4 pieces(orderkey), “Lineitem” and “Orders” table and sliced Only “Lineitem” table is sharded in 4 pieces sliced Dimension tables replicated (orderkey) and sliced
System configuration Intel Xeon E5-4657L v2 @2.40GHz, 4 sockets, 12 cores per socket (total 96 threads with Hyper-threading) Cache - L1=32KB, L2 =256KB, shared L3=30MB. 1TB four channel DDR3 memory, (256 GB memory / socket). O.S. - Fedora 20 Data-set- TPC-H 100GB Tools – numactl, Intel PCM, Linux Perf MonetDB open-source system with memory mapped columnar storage
TPC-H performance 6 N U M A _ O b l i v 5 N U M A _ S h a r d N U M A _ D i s t r ) 4 c e s ( 3 e m i T 2 1 0 4 6 1 5 1 9 T P C - H Q u e r i e s NUMA_Shard is a variation of NUMA_Obliv with sharded & partitioned “orders” table.
Micro-experiments on modified Q6 Why Q6? - select count(*) from lineitem where l_quantity > 24000000; Selection on “lineitem” table Easily parallelizable NUMA effects analysis is easy (read only query)
Process and memory affinity Example: numactl -C 0-11,12-23,24-35 -m 0,1,2 “Database Server” Socket 0 Socket 1 Socket 2 Socket 3 cores 0-11 12-23 24-35 36-47 cores 48-59 60-71 72-83 84-95
NUMA_Obliv Micro-experiments on Q6
Local vs Remote memory access 8 0 8 0 8 0 s s s n n n L o c a l m e m o r y a c c e s s L o c a l m e m o r y a c c e s s L o c a l m e m o r y a c c e s s o o o i 7 0 i 7 0 i 7 0 l l l R e m o t e m e m o r y a c c e s s R e m o t e m e m o r y a c c e s s R e m o t e m e m o r y a c c e s s l l l i i i M M M 6 0 6 0 6 0 n n n i i i s 5 0 s 5 0 s 5 0 e e e s s s s s s 4 0 4 0 4 0 e e e c c c c c c 3 0 3 0 3 0 a a a y y y r r r 2 0 o 2 0 2 0 o o m m m e 1 0 e e 1 0 1 0 M M M 0 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)
Execution time (Robustness) 3 5 0 3 5 0 3 5 0 3 0 0 3 0 0 ) 3 0 0 ) s s d d d n n n o 2 5 0 2 5 0 2 5 0 o c o c e c s e e - 2 0 0 s 2 0 0 i s 2 0 0 l - - l i i i l l l l mi i m ( 1 5 0 1 5 0 m 1 5 0 ( ( e me e m i 1 0 0 1 0 0 1 0 0 i m T i T T 5 0 5 0 5 0 0 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s N u m b e r o f t h r e a d s PMA= yes, BCC=yes PMA= no, BCC=yes PMA= no, BCC=no Most robust Less robust Least robust Process and memory affinity = PMA Buffer cache cleared = BCC (echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop caches)
Increase in threads = more remote accesses?
Distribution of mapped pages 1 0 0 s s o c k e t 3 e g s o c k e t 2 a s o c k e t 1 p 8 0 d s o c k e t 0 e p p 6 0 a m f o 4 0 n o i t r o 2 0 p o r P 0 1 2 2 4 3 6 4 8 N u m b e r o f t h r e a d s /proc/process id/numa maps
# CPU migrations 2 0 0 s n 1 5 0 o i t a r g i m 1 0 0 U P C # 5 0 0 1 2 2 4 3 6 4 8 6 0 7 2 8 4 9 6 N u m b e r o f t h r e a d s
Why remote accesses are bad? 1 6 0 1 4 0 ) s d 1 2 0 n #Local Access # Remote Access o c 1 0 0 e s NUMA_Obliv 69 Million (M) 136 M - i l l 8 0 i m ( 6 0 e NUMA_Distr 196 M 9 M m i 4 0 T 2 0 0 N U M A _ O b l i vN U M A _ D i s t r M o d i fj e d T P C - H Q 6
NUMA_Distr to minimize remote accesses ?
Comparison with Vectorwise 6 M o n e t D B N U M A _ S h a r d 5 M o n e t D B N U M A _ D i s t r ) 4 c V e c t o r _ D e f e s ( V e c t o r _ D i s t r 3 e m i T 2 1 0 4 6 1 5 1 9 T P C - H Q u e r i e s Vectorwise has no NUMA awareness and also uses a dedicated buffer manager
Comparison with Hyper 3 . 5 M o n e t D B N U M A _ D i s t r 3 H y p e r 2 . 5 ) 1.15 c e s 2 ( e m 1 . 5 i T 2.3 1 5.7 2.5 0 . 5 2 0 4 6 9 1 2 1 4 1 5 1 9 T P C - H Q u e r i e s The RED numbers indicate speed-up of Hyper over MonetDB NUMA_Distr plans. Hyper generates NUMA aware, LLVM JIT compiled fused operator pipeline plans.
Conclusion ● NUMA obliviousness fares reasonably to NUMA awareness. ● Process and memory affinity helps NUMA oblivious plans to perform robustly. ● Simple distributed shared nothing database configuration can compete with the state of the art database.
Thank you
Recommend
More recommend