The Hebrew University of Jerusalem A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING Amnon Barak Hebrew University Jerusalem (HUJI) Hermann Härtig TU Dresden, Operating Systems Group (TUDOS) Wolfgang Nagel TU Dresden, Center for Information Services and HPC (ZIH) Alexander Reinefeld Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB) CARSTEN WEINHOLD , TU DRESDEN
TRADITIONAL HPC The Hebrew University of Jerusalem The ideal world assumption: ■ Identical, predictable, and reliable nodes ■ Fast and reliable interconnect ■ Balanced applications ■ Isolated partitions of fixed size A Microkernel-Based OS for Exascale Computing 2
TRADITIONAL HPC The Hebrew University of Jerusalem Time Fixed-size chunks of work One thread per core A Microkernel-Based OS for Exascale Computing 3
TRADITIONAL HPC The Hebrew University of Jerusalem Systems software: optimize communication latency Message passing uses polling Batch scheduler for start / stop Separate servers for I/O Small OS on each node No OS on critical path A Microkernel-Based OS for Exascale Computing 4
LOAD The Hebrew University of Jerusalem Application: CP2K 512 1 Computation time (fraction) 0.9 0.8 384 Process ID 0.7 0.6 257 0.5 0.4 129 0.3 0.2 1 0.1 180 10 20 30 40 50 Timesteps Computation—communication ratio of CP2K on 512 cores A Microkernel-Based OS for Exascale Computing 5
LOAD The Hebrew University of Jerusalem Application: COSMO-SPECS+FD4 128 1 Computation time (fraction) 120 0.9 100 0.8 96 Process ID 0.7 80 0.6 65 60 0.5 40 0.4 33 0.3 20 0.2 0 1 0.1 180 30 60 90 120 150 180 Timesteps Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores A Microkernel-Based OS for Exascale Computing 6
REALITY CHECK The Hebrew University of Jerusalem Application: COSMO-SPECS+FD4 Unbalanced compute times of ranks per time step Hand balanced compute times of ranks per time step Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores A Microkernel-Based OS for Exascale Computing 7
REALITY CHECK The Hebrew University of Jerusalem Application: PRIME Unbalanced compute times of ranks per time step Now think of: • Composite applications • In-situ visualization, etc. A Microkernel-Based OS for Exascale Computing 8
� � FFMK The Hebrew University of Jerusalem FFMK: A Fast and Fault-Tolerant Microkernel-Based Operating System for Exascale Computing German Priority Programme 1648 “Software for Exascale Computing” A Microkernel-Based OS for Exascale Computing 9
CHALLENGES The Hebrew University of Jerusalem Dynamic applications & platforms FFMK - OS FFMK - OS FFMK - OS Increased fault rates FFMK - OS FFMK - OS FFMK - OS Power / Dark silicon FFMK - OS FFMK - OS FFMK - OS Heterogeneity (cores, memory, FFMK - OS FFMK - OS FFMK - OS … ) A Microkernel-Based OS for Exascale Computing 10
NODE ARCHITECTURE The Hebrew University of Jerusalem Application Monitoring Application Application Platform Management Service OS Runtime Support Runtime Linux (drivers, etc.) Light-weight Kernel (L4) Compute cores Service cores A Microkernel-Based OS for Exascale Computing 11
3 ABSTRACTIONS The Hebrew University of Jerusalem Threads Threads Address Address space space Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 12
MESSAGE PASSING The Hebrew University of Jerusalem App App Device Interrupt File Network I/O System Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 13
BLOCK, POLL, IRET The Hebrew University of Jerusalem Block (Linux) 5,3 Block (L4) 3,2 Polling 0,1 0 µs 1 µs 2 µs 3 µs 4 µs 5 µs 6 µs • Intel Core i7 3770S @ 3100 MHz • 64 - bit Linux 3.11 .6 (cpuidle.off=0, • No Hyperthreading, no Turboboost intel_idle.max_cstate=0) • 64 - bit L4/Fiasco.OC • No dynamic power management • Same socket Wake from interrupt on L4: 900 cycles, 0.3 µs (best case, on Intel Core i7-4770 CPU @ 3.40GHz) A Microkernel-Based OS for Exascale Computing 14
LINUX ON L4 The Hebrew University of Jerusalem Linux App Linux OS File Network I/O Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 15
HYBRID SYSTEM The Hebrew University of Jerusalem critical uncritical simple complex Real-time Security: small Trusted Computing Base Resilience: small Reliable Computing Base A Microkernel-Based OS for Exascale Computing 16
HYBRID SYSTEM The Hebrew University of Jerusalem critical uncritical simple complex Service OS File Network I/O Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 17
HYBRID SYSTEM The Hebrew University of Jerusalem MPI App Service Proxy MPI Lib Infiniband Service OS File Network I/O Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 18
DRIVER REUSE The Hebrew University of Jerusalem Msg Buffer 1 Msg Buffer 1 Msg Buffer 2 Msg Buffer 2 L4 App Proxy App libibverbs User-space Driver Linux Kernel /dev/ib0 I/O I/O IB Core Kernel Driver Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 19
NODE ARCHITECTURE The Hebrew University of Jerusalem Application Application Application Service OS Checkpoint MPI Proxies MPI Library Application Application Runtime Infiniband Infiniband Chkpt. Runtime Linux Kernel Light-weight Kernel (L4) Compute cores Service cores A Microkernel-Based OS for Exascale Computing 20
FAILURE RATES The Hebrew University of Jerusalem MTBF for Component Failures in an HPC System Failed Component # of Nodes Affected MTBF 1408 65.10 days PFS, core switch Rack 32 86.90 days 16 17 .37 days Edge switch PSU 4 28.94 days 1 15.8 hours Compute nodes Kento Sato et al., „Design and Modeling of a Non-blocking Checkpointing System“, SC’12 A Microkernel-Based OS for Exascale Computing 21
XTREEMFS The Hebrew University of Jerusalem In-memory XtreemFS volume for application-level checkpointing RAID-5 erasure coding: recovery with 1 failed OSD Demonstrator running BQCD code on a Cray XC30 OSD OSD OSD OSD DIR … MRC Application (BQCD) A Microkernel-Based OS for Exascale Computing 22
BANDWIDTH The Hebrew University of Jerusalem 10000 1000 100 TB/s 10 1 memcpy 64ppn CRUISE 32ppn CRUISE 64ppn CRUISE 16ppn ramdisk 16ppn 0.1 1K 2K 4K 8K 16K 32K 64K 96K Nodes (b) Sequoia Cluster (IBM Blue Gene/Q) Raghunath Rajachandrasekar et al., A 1 PB/s File System to Checkpoint Three Million MPI Tasks, HPDC’13 A Microkernel-Based OS for Exascale Computing 23
NODE ARCHITECTURE The Hebrew University of Jerusalem Application Application Application Platform Management Service OS Checkpoint MPI Proxies MPI Library Application Application Runtime Infiniband Infiniband Chkpt. Runtime Linux Kernel Light-weight Kernel (L4) Compute cores Service cores A Microkernel-Based OS for Exascale Computing 24
OVERDECOMPOSITION The Hebrew University of Jerusalem processing units time Barrier A Microkernel-Based OS for Exascale Computing 25
OVERDECOMPOSITION The Hebrew University of Jerusalem processing units time Barrier A Microkernel-Based OS for Exascale Computing 26
OVERDECOMPOSITION The Hebrew University of Jerusalem processing units time Barrier A Microkernel-Based OS for Exascale Computing 27
OVERDECOMPOSITION The Hebrew University of Jerusalem processing units time Barrier A Microkernel-Based OS for Exascale Computing 28
OVERDECOMPOSITION The Hebrew University of Jerusalem COSMO-SPECS+FD4 (unbalanced, HT) 1000 s 800 s 600 s 400 s 200 s 0 s Baseline 2x 4x 8x 16x Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling) A Microkernel-Based OS for Exascale Computing 29
BUSY WAITING The Hebrew University of Jerusalem Busy waiting = Computation A Microkernel-Based OS for Exascale Computing 30
OVERDECOMPOSITION The Hebrew University of Jerusalem COSMO-SPECS+FD4 (unbalanced, HT) Polling (busy waiting) 1000 s 800 s 600 s 400 s 200 s 0 s Baseline 2x 4x 8x 16x Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling) A Microkernel-Based OS for Exascale Computing 31
OVERDECOMPOSITION The Hebrew University of Jerusalem COSMO-SPECS+FD4 (balanced, no HT) COSMO-SPECS+FD4 (unbalanced, no HT) 300 s 250 s 200 s 150 s 100 s 50 s 0 s Baseline 2x 4x 8x 16x Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling) A Microkernel-Based OS for Exascale Computing 32
OVERDECOMPOSITION The Hebrew University of Jerusalem CP2K (unbalanced, no HT) 350 s 300 s 250 s 200 s 150 s 100 s 50 s 0 s Baseline 2x 4x 8x 16x Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling) A Microkernel-Based OS for Exascale Computing 33
OVERDECOMPOSITION The Hebrew University of Jerusalem processing units time Barrier With MPI: • Do not: busy wait (except very shortly) • Do: Block in kernel • Needs: fast unblocking of threads, when message comes in • We build: shortcut from IB driver into MPI threads (no Linux!) A Microkernel-Based OS for Exascale Computing 34
Recommend
More recommend