An Analysis of SMP Memory Allocators: MapReduce on Large Shared-Memory Systems Robert D¨ obbelin, Thorsten Sch¨ utt, Alexander Reinefeld Zuse Institute Berlin September 10, 2012 1 / 11
SGI Altix UltraViolett (UV) 1000 32 GB Intel Xeon Intel Xeon 32 GB DDR3 X7560 QPI X7560 DDR3 RAM (8 cores) (8 cores) RAM 32 blades in one rack QPI 2 × 8 cores per blade HUB 64 GB memory per blade NUMAlink5 to other blades QPI for memory on same blade inter-blade communication via NUMAlink5 2 / 11
SGI Altix UltraViolett (UV) 1000 32 GB Intel Xeon Intel Xeon 32 GB DDR3 X7560 QPI X7560 DDR3 RAM (8 cores) (8 cores) RAM 32 blades in one rack QPI 2 × 8 cores per blade HUB 64 GB memory per blade NUMAlink5 to other blades QPI for memory on same blade inter-blade communication via NUMAlink5 2 / 11
Memory allocation First-touch policy When a process requests memory from the OS threads gets (unmapped) virtual address page fault on first touch OS allocates physical pages to NUMA node on which accessing thread is running Once a virtual address is mapped, this mapping persists until the page is released to the OS. 3 / 11
Memory allocation Successive malloc / free operations Process Thread A Thread B Thread C Thread D Allocator OS ? 4 / 11
Memory allocation Successive malloc / free operations Process malloc : Thread A gets virtual page Thread A Thread B Thread C Thread D and touches it A Allocator OS 4 / 11
Memory allocation Successive malloc / free operations Process malloc : Thread A gets virtual page Thread A Thread B Thread C Thread D and touches it free : page may be released to the allocators cache Allocator A OS 4 / 11
Memory allocation Successive malloc / free operations Process malloc : Thread A gets virtual page and touches it Thread A Thread B Thread C Thread D free : page may be released to the allocators cache malloc : Thread D gets this page A Allocator Thread D got remote memory from the allo- cator! OS 4 / 11
MapReduce MapReduce workflow MapReduce stages ... map map : apply map-function to input ... ... ... combine (opt) ... ... ... shuffle : merge partitions shuffle ... ... ... reduce : apply reduce-function to all reduce kv-pairs with the same key ... size of buffers unknown a priori iterative MapReduce: output of one MR step is input for the next 5 / 11
MapReduce How to speed things up? Memory allocators for SMPs ( tbbmalloc ) provide fast concurrent allocations Memory reuse ( reuse ) reuse buffers for subsequent MapReduce iterations Memory preallocation ( prealloc ) allocate needed amount of memory for each buffer 6 / 11
Evaluation 8 glibc reuse 7 tbbmalloc tbb_pool 6 prealloc relative speedup 5 4 3 2 1 0 7 15 31 61 127 threads MR-Search with various allocators. Speedup is relative to glibc . Significant speedup if more than one blade is used. 7 / 11
Evaluation 600 8000 time 7000 500 6000 sent data [GByte] 400 5000 time [s] 300 4000 3000 200 2000 100 1000 0 0 glibc reuse tbbmalloc tbb_pool prealloc NUMA traffic and runtime with various allocators (127 Threads). Traffic on NUMAlink traced with Performance Co-Pilot TBB does not prevent remote memory 8 / 11
Evaluation NUMA traffic and runtime with various allocators (127 Threads). Traffic on NUMAlink traced with Performance Co-Pilot TBB does not prevent remote memory 8 / 11
Evaluation 500 400 300 speedup 200 100 0 0 100 200 300 400 500 cores perfect speedup tbbmalloc prealloc, MR only reuse prealloc glibc Scalability with various allocators. 9 / 11
Evaluation 500 400 300 speedup 200 100 0 0 100 200 300 400 500 cores perfect speedup MPI, Cluster MPI, UV OpenMP, UV, prealloc, MR only Comparing scalability: OpenMP vs. explicit message passing 10 / 11
Summary Summary It is not that easy to write scalable code for large SMPs. large variability of memory access costs on large SMPs allocators for SMPs help to increase scalability they do not prevent remote memory programmer needs to keep track of memory location (if possible) 11 / 11
Recommend
More recommend