a microkernel based operating system for exascale
play

A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING Amnon - PowerPoint PPT Presentation

The Hebrew University of Jerusalem A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING Amnon Barak Hebrew University Jerusalem (HUJI) Hermann Hrtig TU Dresden, Operating Systems Group (TUDOS) Wolfgang Nagel TU Dresden, Center


  1. The Hebrew University 
 of Jerusalem A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING Amnon Barak Hebrew University Jerusalem (HUJI) Hermann Härtig TU Dresden, Operating Systems Group (TUDOS) Wolfgang Nagel TU Dresden, Center for Information Services and HPC (ZIH) Alexander Reinefeld Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB) CARSTEN WEINHOLD , TU DRESDEN

  2. TRADITIONAL HPC The Hebrew University 
 of Jerusalem The ideal world assumption: ■ Identical, predictable, and reliable nodes ■ Fast and reliable interconnect ■ Balanced applications ■ Isolated partitions of fixed size A Microkernel-Based OS for Exascale Computing 2

  3. TRADITIONAL HPC The Hebrew University 
 of Jerusalem Time Fixed-size chunks of work One thread per core A Microkernel-Based OS for Exascale Computing 3

  4. TRADITIONAL HPC The Hebrew University 
 of Jerusalem Systems software: 
 optimize communication latency Message passing uses polling Batch scheduler for start / stop Separate servers for I/O Small OS on each node No OS on critical path A Microkernel-Based OS for Exascale Computing 4

  5. LOAD The Hebrew University 
 of Jerusalem Application: CP2K 512 1 Computation time (fraction) 0.9 0.8 384 Process ID 0.7 0.6 257 0.5 0.4 129 0.3 0.2 1 0.1 180 10 20 30 40 50 Timesteps Computation—communication ratio of CP2K on 512 cores A Microkernel-Based OS for Exascale Computing 5

  6. LOAD The Hebrew University 
 of Jerusalem Application: COSMO-SPECS+FD4 128 1 Computation time (fraction) 120 0.9 100 0.8 96 Process ID 0.7 80 0.6 65 60 0.5 40 0.4 33 0.3 20 0.2 0 1 0.1 180 30 60 90 120 150 180 Timesteps Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores A Microkernel-Based OS for Exascale Computing 6

  7. REALITY CHECK The Hebrew University 
 of Jerusalem Application: COSMO-SPECS+FD4 Unbalanced compute times of ranks per time step Hand balanced compute times of ranks per time step Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores A Microkernel-Based OS for Exascale Computing 7

  8. REALITY CHECK The Hebrew University 
 of Jerusalem Application: PRIME Unbalanced compute times of ranks per time step Now think of: • Composite applications • In-situ visualization, etc. A Microkernel-Based OS for Exascale Computing 8

  9. � � FFMK The Hebrew University 
 of Jerusalem FFMK: A Fast and Fault-Tolerant Microkernel-Based Operating System for Exascale Computing German Priority Programme 1648 “Software for Exascale Computing” A Microkernel-Based OS for Exascale Computing 9

  10. CHALLENGES The Hebrew University 
 of Jerusalem Dynamic applications 
 & platforms FFMK - OS FFMK - OS FFMK - OS Increased fault rates FFMK - OS FFMK - OS FFMK - OS Power / Dark silicon FFMK - OS FFMK - OS FFMK - OS Heterogeneity (cores, memory, FFMK - OS FFMK - OS FFMK - OS … ) A Microkernel-Based OS for Exascale Computing 10

  11. NODE ARCHITECTURE The Hebrew University 
 of Jerusalem Application Monitoring Application Application Platform Management Service OS Runtime Support Runtime Linux (drivers, etc.) Light-weight Kernel (L4) Compute cores Service cores A Microkernel-Based OS for Exascale Computing 11

  12. 3 ABSTRACTIONS The Hebrew University 
 of Jerusalem Threads Threads Address Address space space Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 12

  13. MESSAGE PASSING The Hebrew University 
 of Jerusalem App App Device Interrupt File Network I/O System Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 13

  14. BLOCK, POLL, IRET The Hebrew University 
 of Jerusalem Block (Linux) 5,3 Block (L4) 3,2 Polling 0,1 0 µs 1 µs 2 µs 3 µs 4 µs 5 µs 6 µs • Intel Core i7 3770S @ 3100 MHz • 64 - bit Linux 3.11 .6 (cpuidle.off=0, • No Hyperthreading, no Turboboost intel_idle.max_cstate=0) • 64 - bit L4/Fiasco.OC • No dynamic power management • Same socket Wake from interrupt on L4: 900 cycles, 0.3 µs (best case, on Intel Core i7-4770 CPU @ 3.40GHz) A Microkernel-Based OS for Exascale Computing 14

  15. LINUX ON L4 The Hebrew University 
 of Jerusalem Linux App Linux OS File Network I/O Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 15

  16. HYBRID SYSTEM The Hebrew University 
 of Jerusalem critical uncritical simple complex Real-time Security: small Trusted Computing Base Resilience: small Reliable Computing Base A Microkernel-Based OS for Exascale Computing 16

  17. HYBRID SYSTEM The Hebrew University 
 of Jerusalem critical uncritical simple complex Service OS File Network I/O Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 17

  18. HYBRID SYSTEM The Hebrew University 
 of Jerusalem MPI App Service Proxy MPI Lib Infiniband Service OS File Network I/O Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 18

  19. DRIVER REUSE The Hebrew University 
 of Jerusalem Msg Buffer 1 Msg Buffer 1 Msg Buffer 2 Msg Buffer 2 L4 App Proxy App libibverbs User-space Driver Linux Kernel /dev/ib0 I/O I/O IB Core Kernel Driver Light-weight Kernel (L4) A Microkernel-Based OS for Exascale Computing 19

  20. NODE ARCHITECTURE The Hebrew University 
 of Jerusalem Application Application Application Service OS Checkpoint MPI Proxies MPI Library Application Application Runtime Infiniband Infiniband Chkpt. Runtime Linux Kernel Light-weight Kernel (L4) Compute cores Service cores A Microkernel-Based OS for Exascale Computing 20

  21. FAILURE RATES The Hebrew University 
 of Jerusalem MTBF for Component Failures in an HPC System Failed Component # of Nodes Affected MTBF 1408 65.10 days PFS, core switch Rack 32 86.90 days 16 17 .37 days Edge switch PSU 4 28.94 days 1 15.8 hours Compute nodes Kento Sato et al., „Design and Modeling of a Non-blocking Checkpointing System“, SC’12 A Microkernel-Based OS for Exascale Computing 21

  22. XTREEMFS The Hebrew University 
 of Jerusalem In-memory XtreemFS volume for application-level checkpointing RAID-5 erasure coding: recovery with 1 failed OSD Demonstrator running BQCD code on a Cray XC30 OSD OSD OSD OSD DIR … MRC Application (BQCD) A Microkernel-Based OS for Exascale Computing 22

  23. BANDWIDTH The Hebrew University 
 of Jerusalem 10000 1000 100 TB/s 10 1 memcpy 64ppn CRUISE 32ppn CRUISE 64ppn CRUISE 16ppn ramdisk 16ppn 0.1 1K 2K 4K 8K 16K 32K 64K 96K Nodes (b) Sequoia Cluster (IBM Blue Gene/Q) Raghunath Rajachandrasekar et al., A 1 PB/s File System to Checkpoint Three Million MPI Tasks, HPDC’13 A Microkernel-Based OS for Exascale Computing 23

  24. NODE ARCHITECTURE The Hebrew University 
 of Jerusalem Application Application Application Platform Management Service OS Checkpoint MPI Proxies MPI Library Application Application Runtime Infiniband Infiniband Chkpt. Runtime Linux Kernel Light-weight Kernel (L4) Compute cores Service cores A Microkernel-Based OS for Exascale Computing 24

  25. OVERDECOMPOSITION The Hebrew University 
 of Jerusalem processing units time Barrier A Microkernel-Based OS for Exascale Computing 25

  26. OVERDECOMPOSITION The Hebrew University 
 of Jerusalem processing units time Barrier A Microkernel-Based OS for Exascale Computing 26

  27. OVERDECOMPOSITION The Hebrew University 
 of Jerusalem processing units time Barrier A Microkernel-Based OS for Exascale Computing 27

  28. OVERDECOMPOSITION The Hebrew University 
 of Jerusalem processing units time Barrier A Microkernel-Based OS for Exascale Computing 28

  29. OVERDECOMPOSITION The Hebrew University 
 of Jerusalem COSMO-SPECS+FD4 (unbalanced, HT) 1000 s 800 s 600 s 400 s 200 s 0 s Baseline 2x 4x 8x 16x Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling) A Microkernel-Based OS for Exascale Computing 29

  30. BUSY WAITING The Hebrew University 
 of Jerusalem Busy waiting = Computation A Microkernel-Based OS for Exascale Computing 30

  31. OVERDECOMPOSITION The Hebrew University 
 of Jerusalem COSMO-SPECS+FD4 (unbalanced, HT) Polling (busy waiting) 1000 s 800 s 600 s 400 s 200 s 0 s Baseline 2x 4x 8x 16x Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling) A Microkernel-Based OS for Exascale Computing 31

  32. OVERDECOMPOSITION The Hebrew University 
 of Jerusalem COSMO-SPECS+FD4 (balanced, no HT) COSMO-SPECS+FD4 (unbalanced, no HT) 300 s 250 s 200 s 150 s 100 s 50 s 0 s Baseline 2x 4x 8x 16x Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling) A Microkernel-Based OS for Exascale Computing 32

  33. OVERDECOMPOSITION The Hebrew University 
 of Jerusalem CP2K (unbalanced, no HT) 350 s 300 s 250 s 200 s 150 s 100 s 50 s 0 s Baseline 2x 4x 8x 16x Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling) A Microkernel-Based OS for Exascale Computing 33

  34. OVERDECOMPOSITION The Hebrew University 
 of Jerusalem processing units time Barrier With MPI: • Do not: busy wait (except very shortly) • Do: Block in kernel • Needs: fast unblocking of threads, when message comes in • We build: shortcut from IB driver into MPI threads (no Linux!) A Microkernel-Based OS for Exascale Computing 34

Recommend


More recommend