system software for armv8 a with sve
play

System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of - PowerPoint PPT Presentation

System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of FLAGSHIP2020 Project RIKEN Center for Computational Science 9:00 9:25 14 th of January, 2019 Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China


  1. System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of FLAGSHIP2020 Project RIKEN Center for Computational Science 9:00– 9:25 14 th of January, 2019 Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China

  2. Background: Flagship2020 • Missions • Building the Japanese national flagship supercomputer, post K, and • Developing wide range of HPC applications, running on post K, in order to solve social and science issues in Japan • Project organization • Post K Computer development • RIKEN AICS is in charge of development • Fujitsu is vendor partner. • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC, BSC, INRIA, RIKEN) • Applications • The government selected • 9 social & scientific priority issues • 4 exploratory issues and their R&D organizations. NOW 2 20019/1/14 RIKEN Center for Computational Science

  3. Background: Flagship2020 • Missions • Building the Japanese national flagship supercomputer, post Target Applications K, and • Developing wide range of HPC applications, running on post K, Program Brief description in order to solve social and science issues in Japan ① GENESIS MD for proteins ② Genomon Genome processing (Genome alignment) • Project organization Earthquake simulator (FEM in unstructured & structured • Post K Computer development ③ GAMERA grid) • RIKEN AICS is in charge of development Weather prediction system using Big data (structured grid ④ NICAM+LETK stencil & ensemble Kalman filter) • Fujitsu is vendor partner. ⑤ NTChem molecular electronic (structure calculation) • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC, ⑥ FFB Large Eddy Simulation (unstructured grid) BSC, INRIA, RIKEN) • Applications ⑦ RSDFT an ab-initio program (density functional theory) • The government selected Computational Mechanics System for Large Scale Analysis ⑧ Adventure and Design (unstructured grid) • 9 social & scientific priority issues ⑨ CCS-QCD Lattice QCD simulation (structured grid Monte Carlo) • 4 exploratory issues and their R&D organizations. NOW 3 20019/1/14 RIKEN Center for Computational Science

  4. Background: Post-K CPU A64FX Courtesy of FUJITSU LIMITED Architecture Armv8.2-A SVE (512 bit SIMD) Core 48 cores for compute and 2/4 for OS activities DP: 2.7+ TF, SP: 5.4+ TF, HP: 10.8 TF L1D: 64 KiB, 4 way, 230 GB/s(load), 115 GB/s (store) Cache L2: 8 MiB, 16way, 115 GB/s (load), 57 GB/s (store) Memory HBM2 32 GiB, 1024 GB/s CMG: CPU Memory Group Interconnect TofuD (28 Gbps x 2 lane x 10 port) NOC: Network On Chip I/O PCIe Gen3 x 16 lane Technology 7nm FinFET Performance Stream triad: 830+ GB/s Dgemm: 2.5+ TF (90+% efficiency) ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018. 20019/1/14 RIKEN Center for Computational Science 4

  5. Background: An Overview of Post-K Hardware ● Compute Node, Compute + I/O Node connected by 6D mesh/torus Interconnect ● 3-level hierarchical storage system ● 1 st Layer Cache for global file system ● Temporary file systems ● - Local file system for compute node - Shared file system for a job ● 2 nd Layer Lustre-based global file system ● ● 3 rd Layer Storage for archive ● 20019/1/14 RIKEN Center for Computational Science 5

  6. An Overview of System Software Stack Easy of use is one of our KPIs (Key Performance Indicators) Linux Distribution Providing wide range of applications/tools/libraries/compilers Eco-System Fortran, C/C++, OpenMP, Java, … Batch Job System Math libraries Hierarchical File System Tuning and Debugging Tools Parallel File System Parallel Programming Environments Communicati Application-oriente XMP, FDPS, … on d MPI File I/O Process/Thre File I/O for ad Low Level Communication Hierarchical Storage LLIO PIP Multi-Kernel System: Linux and light-weight kernel (McKernel) Armv8 + SVE 20019/1/14 RIKEN Center for Computational Science 6

  7. Post-K Programming Environment Programing Languages and Compilers Script Languages provided by Linux ● ● provided by Fujitsu distributor Fortran2008 & Fortran2018 subset E.g., Python+NumPy, SciPy ● ● Communication Libraries C11 & GNU and Clang extensions ● ● MPI 3.1 & MPI4.0 subset C++14 & C++17 subset and GNU and ● ● Clang extensions Open MPI base (Fujitsu), MPICH (RIKEN ) ● OpenMP 4.5 & OpenMP 5.0 subset Low-level Communication Libraries ● ● uTofu (Fujitsu), LLC(RIKEN ) Java ● ● File I/O Libraries provided by RIKEN ● GCC, LLVM, and Arm compiler will be also available pnetCDF, DTF, FTAR ● Scalable は筑波大・東大が運用する Parallel Programming Language & Domain ● Math Libraries Oakforest-PACS 上でも稼働している。 ● Specific Library provided by RIKEN BLAS, LAPACK, ScaLAPACK, SSL II ( Fujitsu ) ● XcalableMP ● EigenEXA, Batched BLAS ( RIKEN ) ● FDPS (Framework for Developing Particle ● Programming Tools provided by Fujitsu ● Simulator) Profiler, Debugger, GUI ● Process/Thread Library provided by RIKEN ● PiP (Process in Process) ● 7 20019/1/14 RIKEN Center for Computational Science

  8. Open Source Management Tools ● EasyBuild ● Used at CEA ● RIKEN is evaluating it. As an example, CAFFE, a deep learning tool, is ported to an Arm machine using EasyBuild CAFFE consists of several opensource packages: ● - boost, blas, cmake, gflags, google (glog, googletest, snapy, leveldb, protobuf), lmdb, opencv ● Spack ● Used at ECP project ● RIKEN is evaluating Spack also. 20019/1/14 RIKEN Center for Computational Science 8

  9. IHK/McKernel developed at RIKEN IHK: Linux kernel module ● Partition resources (CPU cores, ● Interface for Heterogeneous memory) Allows dynamically partitioning of node resources: Kernels ● CPU cores, physical memory, … Full Linux kernel on some cores ● Enables management of LWKs (assign resources, System daemons and in-situ non ● ● load, boot, destroy, etc..) HPC applications Provides inter-kernel communication, messaging ● Device drivers ● and notification Light-weight kernel(LWK), McKernel ● McKernel: Light-weight kernel ● on other cores Is designed for HPC, noiseless, simple ● HPC applications ● Implements only performance sensitive system ● calls, e.g., process and memory management, and the rest are offloaded to Linux Executes the same binary of ● In-situ non HPC application System Linux without any daemons HPC Applications recompilation Linu x Complex Linux API (glibc, /sys/, /proc/) TCP stack VFS Mem. Mngt. Thin LWK • IHK/McKernel runs on ? Very simple File Sys General Process/Thread memory • Intel Xeon and Xeon phi Dev. Drivers management Driers scheduler management … … • Fujitsu FX10 and FX100 Core Core Core Core Core Core (Experiments) Memory Parti Parti Interrupt tion tion 20019/1/14 RIKEN Center for Computational Science 9

  10. How to deploy IHK/McKernel • Linux Kernel with IHK kernel module is resident – daemons for job scheduler and etc. run on Linux • McKernel is dynamically reloaded (rebooted) by IHK for each application • No hardware reboot App B, requiring App A, requiring LWK-with-scheduler, LWK-without-schedu Is invoked ler, Is invoked Finish App C, using full Linux Finish capability, Is invoked Finish 20019/1/14 RIKEN Center for Computational Science 10

  11. miniFE (CORAL benchmark suite) Oakforest-PACS supercomputer, 25 PF in ● Conjugate gradient - strong scaling peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo ● Up to 3.5X improvement (Linux falls over.. ) 3.5X Results using the same binary Balazs Gerofi, Rolf Riesen, Robert W. Wisniewski and Yutaka Ishikawa: “Toward Full Specialization of the HPC System Software Stack: Reconciling Application Containers and Lightweight Multi-kernels”, International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2017 20019/1/14 RIKEN Center for Computational Science 11

  12. Support of Software Development/Porting for Post-K Contribution to Arm HPC (Armv8-A SVE) Ecosystem NOW CY2017 CY2018 CY2019 CY2020 CY2021 Installation, Operation Design and Implementation Manufacturing and Tuning Specification Armv8-A + SVE Overview Detailed hardware info. Optimization Publishing Incrementally Guidebook RIKEN Performance estimation tool using FX100 Performance Evaluation Environment RIKEN Simulator Early Access Program • CY2018. Q2, Optimization guidebook is incrementally published • CY2020. Q2, Early access program start • CY2021. Q1/Q2, General operation starts 20019/1/14 RIKEN Center for Computational Science 12

  13. Concluding Remarks https://postk-web.r-ccs.riken.jp/faq.html 20019/1/14 RIKEN Center for Computational Science 13

  14. BACKUP 14

  15. MPI Communication implemented using Tofu2 and TofuD Tofu2 and TofuD offloading mechanism ● Posting send commands (PUT, GET, NOP) to ● a command queue, the Tofu network interface processes posted commands. Tofu2 has two packet processing modes: ● Normal Mode and Session Mode. In the Session Mode, a special register called Scheduling Pointer plays important role. Scheduling Pointer: Commands enqueued in ● the command queue are processed until reaching an entry pointed by the Scheduling Pointer. Scheduling Pointer is updated by a packet sent by remote node 20019/1/14 RIKEN Center for Computational Science 15

Recommend


More recommend