Design Principles for End-to-End Multicore Schedulers Paul Barham † Simon Peter ⋆ Adrian Schüpbach ⋆ Rebecca Isaacs † Tim Harris † Andrew Baumann ⋆ Timothy Roscoe ⋆ † Microsoft Research ⋆ Systems Group, ETH Zurich HotPar’10 c � Systems Group | Department of Computer Science | ETH Zürich HotPar’10
Context: Barrelfish Multikernel operating system ◮ Developed at ETHZ and Microsoft Research ◮ Scalable research OS on heterogeneous multicore hardware ◮ Operating system principles and structure ◮ Programming models and language runtime systems ◮ Other scalable OS approaches are similar ◮ Tessellation, Corey, ROS, fos, ... ◮ Ideas in this talk more widely applicable HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 2
Today’s talk topic OS Scheduler architecture for today’s (and tomorrow’s) multicore machines ◮ General-purpose setting: ◮ Dynamic workload mix ◮ Multiple parallel apps ◮ Interactive parallel apps HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 3
Why this is a problem A simple example ◮ Run 2 OpenMP applications concurrently ◮ On 16-core AMD Shanghai system ◮ Intel OpenMP library ◮ Linux OS HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 4
Why this is a problem Example: 2x OpenMP on 16-core Linux ◮ One app is CPU-Bound: #pragma omp parallel for(;;) iterations[omp_get_thread_num()]++; ◮ Other is synchronization intensive (eg. BARRIER): #pragma omp parallel for(;;) { #pragma omp barrier iterations[omp_get_thread_num()]++; } HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 5
Why this is a problem Example: 2x OpenMP on 16-core Linux ◮ Run for x in [ 2 .. 16 ] : ◮ OMP_NUM_THREADS= x ./BARRIER & ◮ OMP_NUM_THREADS=8 ./cpu_bound & ◮ sleep 20 ◮ killall BARRIER cpu_bound ◮ Plot average iterations /thread/s over 20s HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 6
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound BARRIER 1 Relative Rate of Progress 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound BARRIER 1 Relative Rate of Progress 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound Until 8 BARRIER threads BARRIER 1 Relative Rate of Progress 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound Until 8 BARRIER threads BARRIER 1 CPU-Bound stays at 1 Relative Rate of Progress (same thread allocation) 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound Until 8 BARRIER threads BARRIER 1 CPU-Bound stays at 1 Relative Rate of Progress (same thread allocation) 0.8 BARRIER degrades 0.6 (due to increasing cost) 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound Until 8 BARRIER threads BARRIER 1 CPU-Bound stays at 1 Relative Rate of Progress (same thread allocation) 0.8 BARRIER degrades 0.6 (due to increasing cost) 0.4 Space-partitioning 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound From 9 threads BARRIER (threads > cores) 1 Time-multiplexing Relative Rate of Progress 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound From 9 threads BARRIER (threads > cores) 1 Time-multiplexing Relative Rate of Progress 0.8 CPU-Bound degrades linearly 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound From 9 threads BARRIER (threads > cores) 1 Time-multiplexing Relative Rate of Progress 0.8 CPU-Bound degrades linearly 0.6 BARRIER drops sharply 0.4 (only makes progress when all threads run 0.2 concurrently) 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 7
Why this is a problem Example: 2x OpenMP on 16-core Linux ◮ Gang scheduling or smart core allocation would help ◮ Gang scheduling: ◮ OS unaware of apps’ requirements ◮ The run-time system could’ve known ◮ Eg. via annotations or compiler ◮ Smart core allocation: ◮ OS knows general system state ◮ Run-time system chooses number of threads ◮ Information and mechanisms in the wrong place HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 8
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound BARRIER 1 Relative Rate of Progress 0.8 Huge error bars 0.6 (min/max over 20 runs) 0.4 Random placement of threads to cores 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 9
Why this is a problem 16-core AMD Shanghai system Core Core Core Core HT Core Core Core Core L3 L3 HT HT Core Core Core Core Core HT Core Core Core L3 L3 ◮ Same-die L3 access twice as fast as cross-die ◮ OpenMP run-time does not know about this machine HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 10
Why this is a problem 16-core AMD Shanghai system Core Core Core Core HT Core Core Core Core L3 L3 HT HT Core Core Core Core Core HT Core Core Core L3 L3 ◮ Same-die L3 access twice as fast as cross-die ◮ OpenMP run-time does not know about this machine HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 10
Why this is a problem 16-core AMD Shanghai system Core Core Core Core HT Core Core Core Core L3 L3 HT HT Core Core Core Core Core HT Core Core Core L3 L3 ◮ Same-die L3 access twice as fast as cross-die ◮ OpenMP run-time does not know about this machine HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 10
Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 CPU-Bound BARRIER 1 Relative Rate of Progress 0.8 0.6 2 threads case: 0.4 Performance difference of 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 11
Why this is a problem System diversity Core Core Core Core HT3 Core Core Core Core Core Core Core Core HT3 L3 L3 FB DIMM FB DIMM FB DIMM FB DIMM AMD Opteron (Magny-Cours) MCU MCU MCU MCU ◮ On-chip interconnect L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Full Cross Bar C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU Sun Niagara T2 ◮ Flat, fast cache hierarchy Intel Nehalem (Beckton) ◮ On-die ring network HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 12
Why this is a problem System diversity Core Core Core Core HT3 Core Core Core Core Core Core Core Core HT3 L3 L3 FB DIMM FB DIMM FB DIMM FB DIMM AMD Opteron (Magny-Cours) MCU MCU MCU MCU ◮ On-chip interconnect Manual tuning increasingly difficult L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Architectures change too quickly Full Cross Bar Offline auto-tuning (eg. ATLAS) limited C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU Sun Niagara T2 ◮ Flat, fast cache hierarchy Intel Nehalem (Beckton) ◮ On-die ring network HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 12
Online adaptation ◮ Online adaptation remains viable ◮ Easier with contemporary runtime systems ◮ OpenMP, Grand Central Dispatch, ConcRT, MPI, ... ◮ Synchronization patterns are more explicit ◮ But needs information at right places HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 13
The end-to-end approach ◮ The system stack: Component Related work Hardware Heterogeneous, ... OS scheduler CAMP, HASS, ... Runtime systems OpenMP, MPI, ConcRT, McRT, ... Compilers Auto-parallel., ... Programming paradigms MapReduce, ICC, ... Applications annotations, ... ◮ Involve all components, top to bottom ◮ Need to cut through classical OS abstractions ◮ Here we focus on OS / runtime system integration HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 14
Design Principles HotPar’10 Systems Group | Department of Computer Science | ETH Zürich 15
Recommend
More recommend