Towards Exascale Computing Yutaka Ishikawa University of Tokyo RIKEN AICS
Outline of This Talk • Activities in U. of Tokyo and Riken AICS – Many-core based PC Cluster – System Software Stack – Prototype System • Rethinking of How to use MPI Library in state-of-the- art supercomputers – Are MPI_Isend/MPI_Recv really help for overlapping programming ? Yutaka Ishikawa @ University of Tokyo/RIKEN AICS 2
Post T2K Todai 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2010 “K” Computer Exa-scale 10 PFlops Supercomputer 100+ PFlops SR11000 Fujitsu FX10 (1PFlops) Market Hitach SR16000/M1 (54.9 Tflops) T2K Todai 40 to 100 PFlops R&D 140TFlops (HA8000 Cluster) Hongo Campus Kashiwa Campus • PRIMEHPC FX10 • 4800 Node (16 core/node) • 1.13 PFlops • 150 Tbyte Memory • Hitachi SR16000/M1 • 56 Node (32 core/node) • 54.9 TFlops • 11200 Gbyte Memory FX10 HA8000 SR16000/M1 Yutaka Ishikawa @ University of Tokyo/RIKEN AICS 3
Variations of Many-core based machines Many-core board connected to PCI-Express Many-core chip connected to system bus e.g., Intel Knights Ferry, Knights Corner Not existing so far Many Core Board Host Board memory memory Host CPU Host CPU IOH PCI-Express IOH memory memory memory memory memory memory memory Many-core inside CPU die Many-core only c.f., Intel Sandy Bridge with GPU Not existing fo far memory memory IOH memory memory http://pc.watch.impress.co.jp/docs/column/kaigai/20100412_ 360173.html Yutaka Ishikawa @ University of Tokyo/RIKEN AICS 4
Post T2K System Image: Requirements • Both requirements of large data analysis and number crunching applications must Many Core Units • Number Crunching be satisfied. – Performance of I/O Host CPU Units • Controlling Many Core Units – Performance of floating point • Processing data analysis code • Handling File System operations – Parallel Performance Many Core Host Node Many Core Many Core CPU Unit Host Node Many Core Many Core Node Many Core Host CPU Unit Many Core Unit Node Many Core Many Core Node Many Core Host SSD CPU Unit Node Many Core Unit Node Many Core Unit CPU Unit Many Core SSD Node Many Core Interconnect Unit Many Core Unit SSD Unit Many Core Unit Interconnect SSD Unit Interconnect Interconnect Network for Nodes and Storage Area Network Yutaka Ishikawa @ University of Tokyo/RIKEN AICS 5
Post T2K System Image: Execution Image • Both requirements of large data analysis and number crunching applications must Many Core Units • Number Crunching be satisfied. – Performance of I/O Host CPU Units • Controlling Many Core Units – Performance of floating point • Processing data analysis code • Handling File System operations – Parallel Performance Many Core Host Node Many Core Many Core CPU Unit Host Node Co-execution of 2 types of job within partition Many Core Many Core Node Many Core Host CPU Unit Many Core Unit Node Many Core Many Core Node Many Core Host SSD CPU Unit Node Many Core Unit Node Many Core ManyCores: Number crunching application Unit CPU Unit Many Core SSD Node Many Core Interconnect Unit Many Core Unit Host CPU is used for file I/O and memory swap SSD Unit Many Core Unit Interconnect SSD Unit Interconnect Host CPUs: I/O intensive application Interconnect Network for Nodes and Storage Area Network Yutaka Ishikawa @ University of Tokyo/RIKEN AICS 6
Post T2K System Image: Execution Image • Both requirements of large data analysis and number crunching applications must Many Core Units • Number Crunching be satisfied. – Performance of I/O Host CPU Units • Controlling Many Core Units – Performance of floating point • Processing data analysis code • Handling File System operations – Parallel Performance Many Core Host Node Many Core Many Core CPU Unit Host One Job execution within partition Node Many Core Many Core Node Many Core Host CPU Unit Many Core Unit Node Many Core Many Core Node Many Core Host SSD CPU Unit Node Many Core Unit Node Many Core Unit CPU Unit Many Core SSD ManyCores: Computation and Communication Node Many Core Interconnect Unit Many Core Unit SSD Unit Many Core Unit Interconnect SSD Unit Host CPUs: Memory Share/Swap Interconnect Interconnect & Communication & I/O Network for Nodes and Storage Area Network Yutaka Ishikawa @ University of Tokyo/RIKEN AICS 7
Post T2K System Image: Execution Image • Both requirements of large data analysis do { for (……) { and number crunching applications must Many Core Units for (…..) { • Number Crunching for (……) { be satisfied. /* Computation */ /* Due to the limited memory in many – Performance of I/O Host CPU Units core units, data is swapped to memory • Controlling Many Core Units in Host CPU */ – Performance of floating point • Processing data analysis code } • Handling File System operations } } /* Now many data is located in Host memory */ – Parallel Performance /* Data exchange among remote node */ Many Core } while (…); Host Node Many Core Many Core CPU Unit Host One Job execution within partition Node Many Core Many Core Node Many Core Host CPU Unit Many Core Computation Unit Node Many Core Many Core Node Many Core Host SSD CPU Unit Node Many Core Unit Node Many Core Unit CPU Unit Many Core Communication SSD ManyCores: Computation and Communication Node Many Core Interconnect Unit Many Core Unit SSD Unit Many Core Unit Interconnect SSD Unit Host CPUs: Memory Share/Swap Interconnect Interconnect & Communication & I/O Network for Nodes and Storage Area Network Yutaka Ishikawa @ University of Tokyo/RIKEN AICS 8
Post T2K System Software Stack In case of Non-Bootable Many Core In case of Bootable Many Core Next –Generation NG Comm. MPI Comm. Library MPI MPI NG Comm. P2P MPI NG Comm. MPI NG Comm. Comm. Library Comm. Library Comm. Library P2P Library Parallel File System Comm. Comm. Library Library Basic Comm. Lib. Library Basic Comm. Lib. Parallel File System Library Basic Comm. Lib. Library Basic Comm. Lib. Basic Comm. Lib. Glibc for manycore Glibc for manycore Glibc Glibc for manycore Glibc for manycore Glibc Glibc for manycore Micro Kernel Linux Kernel Micro Kernel Micro Kernel Micro Kernel Micro Kernel Linux Kernel SMSL SMSL SMSL SMSL SMSL SMSL SMSL AAL ‐ Host IKCL IKCL IKCL IKCL IKCL AAL ‐ Host IKCL IKCL Device Driver AAL ‐ Manycore AAL ‐ Manycore AAL ‐ Manycore AAL ‐ Manycore AAL ‐ Manycore Device Driver PCI-Express Host Many Core Many Core Infiniband PCI-Express Network Card Infiniband Network Card • AAL (Accelerator Abstraction Layer) – Provides low-level accelerator interface Design Criteria – Enhances portability of the micro kernel • Cache-aware system software stack • IKCL (Inter-Kernel Communication Layer) • Scalability – Provides generic-purpose communication and data • Minimum overhead of communication facility transfer mechanisms • Portability • SMSL (System Service Layer) – Provides basic system services on top of the Yutaka Ishikawa @ University of Tokyo/RIKEN AICS 9 communication layer
Post T2K System Software Stack In case of Non-Bootable Many Core In case of Bootable Many Core Next –Generation NG Comm. MPI Comm. Library MPI MPI NG Comm. P2P MPI NG Comm. MPI NG Comm. Comm. Library Comm. Library Comm. Library P2P Library Parallel File System Comm. Comm. Library Library Basic Comm. Lib. Library Basic Comm. Lib. Parallel File System Library Basic Comm. Lib. Library Basic Comm. Lib. Basic Comm. Lib. Glibc for manycore Glibc for manycore Glibc Glibc for manycore Glibc for manycore Glibc Glibc for manycore Micro Kernel Linux Kernel Micro Kernel Micro Kernel Micro Kernel Micro Kernel Linux Kernel SMSL SMSL SMSL SMSL SMSL SMSL SMSL AAL ‐ Host IKCL IKCL IKCL IKCL IKCL AAL ‐ Host IKCL IKCL Device Driver AAL ‐ Manycore AAL ‐ Manycore AAL ‐ Manycore AAL ‐ Manycore AAL ‐ Manycore Device Driver PCI-Express Host Many Core Many Core Infiniband PCI-Express Network Card Infiniband Network Card • AAL (Accelerator Abstraction Layer) Because manycores have small memory – Provides low-level accelerator interface Design Criteria caches and limited memory – Enhances portability of the micro kernel • Cache-aware system software stack bandwidth, the footprint in the cache • IKCL (Inter-Kernel Communication Layer) • Scalability during both user and system program – Provides generic-purpose communication and data • Minimum overhead of communication facility executions should be minimized. transfer mechanisms • Portability • SMSL (System Service Layer) – Provides basic system services on top of the Yutaka Ishikawa @ University of Tokyo/RIKEN AICS 10 communication layer
Recommend
More recommend