Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojević, Yili Zheng Lawrence Berkeley National Laboratory Parallel & Distributed Processing (IPDPS), 2010 1 / 11
Motivation Increasingly parallel and asymmetric hardware (architecture + performance) Existing runtimes in competitive environments Partitioning vs. sharing on real hardware 2 / 11
Oversubscription + Compensate for data and control dependencies Decrease resource contention Improve CPU utilization − Overhead for migration, context switching and lost hardware state ( negligible ) Slower synchronization due to increased contention 3 / 11
Setup MPI (MPICH 2), UPC, OpenMP Synchronization: poll + yield Linux 2.6.27, 2.6.28, 2.6.30 Intel compiler with − O3 NPB without load imbalances (separate paper) Processor Clock GHz Cores L1 data/instr L2 cache L3 cache Memory/core NUMA Tigerton Intel Xeon E7310 1.6 16 (4x4) 32K/32K 4M / 2 cores none 2GB no Barcelona AMD Opteron 8350 2 16 (4x4) 64K/64K 512K / core 2M / socket 4GB socket Nehalem Intel Xeon E5530 2.4 16 (2x4x2) 32K/32K 256K / core 8M / socket 1.5G / core socket 4 / 11
Benchmark Characteristics Barrier Performance ‐ AMD Barcelona 160 60 Time (microsec) 50 40 1 30 2 20 4 10 8 0 16 1/core 2/core 4/core 1/core 2/core 4/core 1/core 2/core 4/core UPC OpenMP MPI 5 / 11
Benchmark Characteristics Barrier Performance ‐ AMD Barcelona 160 60 Time (microsec) 50 40 1 30 2 20 4 10 8 0 16 1/core 2/core 4/core 1/core 2/core 4/core 1/core 2/core 4/core UPC OpenMP MPI UPC NPB 2.4 Barrier Stats, 16 threads 10000 13 50 1000 13 Inter-barrier time (ms) 140 13 100 7688 91 56 13677 1240 7688 10 91 13677 17877 7688 91 1114 17877 378 13677 1 3777 0.1 A B C A B C A B C A B C A B C A B C A B C cg ep ft is mg sp bt 5 / 11
UPC — UMA vs. NUMA 2 UPC Tigerton Performance relative to 1/core CFS PSX yield PIN 1.5 sched_yield: default vs. POSIX 1 Pinning affects variance ( 120 % vs. 0.5 10 % ) and memory affinity 0 248 248 248 24 248 248 248 248 248 4 4 4 248 248 248 24 248 248 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg 6 / 11
UPC — UMA vs. NUMA 2 UPC Tigerton Performance relative to 1/core CFS PSX yield PIN 1.5 sched_yield: default vs. POSIX 1 Pinning affects variance ( 120 % vs. 0.5 10 % ) and memory affinity 0 248 248 248 24 248 248 248 248 248 4 4 4 248 248 248 24 248 248 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg Small overall effect ( ± 2 % avg) EP: computationally intensive UPC Barcelona 2 Performance relative to 1/core CFS FT, IS: improvement up to 46 % PSX yield PIN 1.5 SP, MG: problem size ↔ 1 granularity 0.5 CG: degradation up to 44 % 0 248 248 248 24 248 248 248 248 248 4 4 4 248 248 248 24 248 248 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg 6 / 11
Balance Balance UPC Tigerton 0.3 0.3 0.3 0.3 0.3 0.3 Improvement over 1/core 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0 0 0 0 0 0 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 248 248 248 24 248 248 248 248 248 4 4 4 248 248 248 24 248 248 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg Figure 5. Changes in balance on UMA, reported as the ratio between the lowest and highest user time across all cores compared to the 1/core setting. 7 / 11
Cache Miss Rate (LLC / L2) Cache miss rate UPC Tigerton 0.4 Improvement over 1/core 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0 0 0 0 0 0 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 248 248 248 24 248 248 248 248 248 4 4 4 248 248 248 24 248 248 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg Figure 6. Changes in the total number of cache misses per 1000 instructions, across all cores com- pared to 1/core. The EP miss rate is very low. 8 / 11
MPI and OpenMP 2 MPI Tigerton Performance relative to 1/core CFS PSX yield PIN 1.5 Overall decrease by 10 % 1 Caused by barrier overhead (cp. modified UPC) 0.5 0 24 24 24 2 4 2 4 2 4 2 4 2 4 2 4 4 4 4 2 4 2 4 2 4 2 4 2 4 2 4 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg 9 / 11
MPI and OpenMP 2 MPI Tigerton Performance relative to 1/core CFS PSX yield PIN 1.5 Overall decrease by 10 % 1 Caused by barrier overhead (cp. modified UPC) 0.5 0 24 24 24 2 4 2 4 2 4 2 4 2 4 2 4 4 4 4 2 4 2 4 2 4 2 4 2 4 2 4 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg OMP Nehalem 2 Performance relative to 1/core CFS PSX yield Slight degradation PIN 1.5 Best performance with OMP_STATIC 1 KMP_BLOCKTIME 0 Improvement up to 10 % for 0.5 fine-grained benchmarks 0 ∞ Best overall performance 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 A B C S A B C S A B C S A B C S A B C S A B C S ep ft is sp mg cg 9 / 11
Competitive Environments Sharing (best effort) vs. Partitioning (isolated on sockets) One thread per core Overall 33 % / 23 % improvement with sharing for UPC/OpenMP on Barcelona (CMP) but no difference for Nehalem (SMT) Better for application with differing behavior Oversubscription . . . improves benefits of sharing for CMP changes relative order of performance for UPC, MPI, OpenMP Imbalanced sharing possible 10 / 11
Conclusion “Intuitively, oversubscription increases diversity in the system and decreases the potential for resource conflicts.” “All of our results and analysis indicate that the best predictor of application behavior when oversubscribing is the average inter-barrier interval. Applications with barriers executed every few ms are affected, while coarser grained applications are oblivious or their performance improves.” “We expect the benefits of oversubscription to be even more pronounced for irregular applications that suffer from load imbalance.” 11 / 11
Recommend
More recommend