The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew Baumann and Rebecca Isaacs. Joint work with colleagues at MSR Cambridge and ETH Zurich.
The Barrelfish project • Collaboration between ETH Zurich and MSRC Andrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Simon Peter, Jan Rellermeyer, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania, Pierre-Evariste Dagand, Ankush Gupta, Raffaele Sandrini, Dario Simone, Animesh Trivedi
Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain
Do we need a new OS? SunSPARC Enterprise M9000 server M9000-64, up to 64 CPUs, 256 cores 180cm x 167.4cm x 126cm 1880kg
Do we need a new OS? SGI Origin 3000 Up to 512 processors Up to 1TB memory
Do we need a new OS? • How might the design of a CMP differ from these existing systems? • How might the workloads for a CMP differ from those of existing multi-processor machines?
The cliched single-threaded perf graph The things that would have used this “lost” perf must now be written to use cores/accel Historical 1-thread perf Log (seq. perf) gains via improved clock rate and transistors used to extract ILP #transistors still growing, but delivered as additional cores and accelerators Year
time time Output Output Interactive perf User input User input Output Output User input User input
CC-NUMA architecture Adding more CPUs brings more of most other things CPU1 CPU2 CPU3 CPU4 CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM RAM & directory To interconnect CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM Locality property: only go to interconnect for real I/O or sharing
Machine architecture More cores brings more cycles 1 Core1 2 3 Core2 4 5 Core3 6 Core4 7 8 ...not necessarily proportionately more cache L2 L2 ...nor more off- To RAM chip b/w or total RAM capacity
Machine diversity: AMD 4-core
...Sun Niagara-2
...Sun Rock
IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE 2010 J. Howard1, S. Dighe1, Y. Hoskote1, S. Vangal1, D. Finan1, G. Ruhl1, D. Jenkins1, H. Wilson1, N. Borkar1, G. Schrom1, F. Pailet1, S. Jain2, T. Jacob2, S. Yada2, S. Marella2, P. Salihundam2, V. Erraguntla2, M. Konow3, M. Riepen3, G. Droege3, J. Lindemann3, M. Gries3, T. Apel3, K. Henriss3, T. Lund-Larsen3, S. Steibl3, S. Borkar1, V. De1, R. Van Der Wijngaart4, T. Mattson5 1 Intel, Hillsboro, OR, 2 Intel, Bangalore, India, 3 Intel, Braunschweig, Germany 4 Intel, Santa Clara, CA, 5 Intel, DuPont, WA A 567mm2 processor on 45nm CMOS integrates 48 IA-32 cores and 4 DDR3 channels in a 6×4 2D-mesh network. Cores communicate through message passing using 384KB of on-die shared memory. Fine-grain power management takes advantage of 8 voltage and 28 frequency islands to allow independent DVFS of cores and mesh. As performance scales, the processor dissipates between 25W and 125W.
Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain
The multikernel model App App App App OS node OS node OS node OS node Async messages State State State State replica replica replica replica x86 x64 ARM GPU Hardware interconnect
Barrelfish: a multikernel OS • A new OS architecture for scalable multicore systems • Approach: structure the OS as a distributed system • Design principles: – Make inter-core communication explicit – Make OS structure hardware-neutral – View state as replicated
#1 Explicit inter-core communication • All communication with messages • Decouples system structure from inter-core communication mechanism • Communication patterns explicitly expressed • Better match for future hardware – Naturally supports heterogeneous cores, non- coherent interconnects (PCIe) – with cheap explicit message passing – without cache-coherence (e.g. Intel 80-core) • Allows split-phase operations
Communication latency
Communication latency
Message passing vs shared memory • Shared memory (move the data to the operation): – Each core updates the same memory locations – Cache-coherence migrates modified cache lines
Shared memory scaling & latency
Message passing • Message passing (move operation to the data): – A single server core updates the memory locations – Each client core sends RPCs to the server
Message passing
Message passing
#2 Hardware-neutral structure • Separate OS structure from hardware • Only hardware-specific parts: – Message transports (highly optimised / specialised) – CPU / device drivers • Adaptability to changing performance characteristics – Late-bind protocol and message transport implementations
#3 Replicate common state • Potentially-shared state accessed as if it were a local replica – Scheduler queues, process control blocks, etc. – Required by message-passing model • Naturally supports domains that do not share memory • Naturally supports changes to the set of running cores – Hotplug, power management
Replication vs sharing as the default • Replicas used as an optimisation in other systems • In a multikernel, sharing is a local optimisation – Shared (locked) replica on closely-coupled cores – Only when faster, as decided at runtime • Basic model remains split-phase
Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain
Applications running on Barrelfish • Slide viewer (but not today...) • Webserver (www.barrelfish.org) • Virtual machine monitor (runs unmodified Linux) • Parallel benchmarks: – SPLASH-2 – OpenMP • SQLite • ECLiPSe (constraint engine) • more. . .
1-way URPC message costs Cycles Msg / K-Cycle 2*4-core Intel Shared 180 11.97 Non-shared 570 3.78 2*2-core AMD Same die 450 3.42 1 hop 532 3.19 4*4-core AMD Shared 448 3.57 1 hop 545 3.53 2 hop 659 3.19 8*4-core AMD Shared 538 2.77 1 hop 613 2.79 2 hop 682 2.71 • Two hyper-transport requests on AMD
Local vs remote messaging Cycles Msg / K- I-cache D-cache Cycle lines used lines used 2*2-core AMD URPC 450 3.42 9 8 L4 IPC 424 2.36 25 13 • URPC to a remote core compares favourably with IPC • No context switch: TLB unaffected • Lower cache impact • Higher throughput for pipelined messages
Communication perf: IP loopback Barrelfish Linux Throughput (Mbit/s) 2154 1823 D-cache misses per packet 21 77 Source->Sink HT bytes per packet 1868 2628 Sink->Source HT bytes per packet 752 2200 Source->Sink HT link utilization 8% 11% Sink->Source HT link utilization 3% 9% • 2*2-core AMD system, 1000-byte packets – Linux: copy in / out of shared kernel buffers – Barrelfish: point-to-point URPC channel
Case study: TLB shoot-down • Send a message to every core with a mapping • Wait for acks • Linux/Windows: – Send IPI – Spin on shared ack count • Barrelfish: – Request to local monitor domain – 1-phase commit to remote cores – Plug in different communication mechanisms
TLB shoot-down: n*unicast cache-lines ... ... read write
TLB shoot-down: 1*broadcast ...
Messaging costs
TLB shoot-down: multicast ... ... Same package (shared L3)
TLB shoot-down: NUMA-aware m’cast More hyper-transport hops ... ... Same package (shared L3)
Messaging costs
End-to-end comparative latency
2-PC pipelining
Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain
Terminology • Domain – Protection domain/address space (“process”) • Dispatcher – One per domain per core – Scheduled by local CPU driver • Invokes upcall, which then typically runs a core-local user- level thread scheduler • Domain spanning – Start instances of a domain on multiple cores • cf start affinitized threads
Programming example: domain spanning 1 for i = 1..num_cores-1: 2 create a new dispatcher on core i 3 while (num_dispatchers < num_cores-1): 4 wait for the next message and handle it 1 dispatcher_create_callback: 2 num_dispatchers++
Domain spanning: baseline monitor working monitor blocked monitor polling • Centralized: monitor bzero spantest.exe – Poor scalability, but correct name service • 1021 messages, 487 alloc. RPCs memory server 50 million cycles (40ms)
Domain spanning: v2 monitor working monitor blocked monitor polling • Per-core memory servers monitor bzero spantest.exe • Better memset(!) name service memory server Was 50M, now 9M
Domain spanning: v3 monitor working monitor blocked monitor polling • Monitors use per-core mem. server monitor bzero spantest.exe • Move zeroing off the critical path name service memory server Was 9M, now 4M
Domain spanning: v4 monitor working monitor blocked monitor polling • Change the API monitor bzero spantest.exe • Create domains on all cores at once name service memory server • 76 messages Was 4M, now 2.5M
Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain
Recommend
More recommend