scale out computing model on massive core system from hpc
play

Scale-out Computing Model on Massive Core System: From HPC to - PowerPoint PPT Presentation

Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC Dr. Fu Li li@qcftech.com Quantum Cloud Future (Beijing) Technologies Co., Ltd. Cook Book 1. What is Massive Core System (MCS)? 1.1. HPC system 1.2. GPU system


  1. Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC Dr. Fu Li li@qcftech.com Quantum Cloud Future (Beijing) Technologies Co., Ltd.

  2. Cook Book 1. What is Massive Core System (MCS)? 1.1. HPC system 1.2. GPU system new 1.3. MicroSlides: Fabric-based SoC 2. Why scale-out computing is important in MCS? 3. How to make MCS faster? 3.1. MPI and openMP in HPC 3.2. Memory coalescing and cudaDMA in GPU computing 4. QCF’s scale-out computing model for Microslides 4.1. the hardware (Socionext) 4.2. the architecture 4.3. the result (arm vs x86 vs GPU) Quantum Cloud Future (Beijing) Technology Co. Ltd.

  3. Introduction to Quantum Cloud Quantum Theory and Spectroscopy Content -Centric Networking GPU switch Molecular Dynamics Fast Fourier Transform Doppler ASIC Boba FPGA CUDA Cloud Storage HPC Statistic Mechanics PacketShader MPI, OpenMP With background from Quantum calculation, 1) we perform large-scale molecular dynamics simulation on HPC cluster using Amber and Gromacs, 2) we optimize Fourier transform and matrix operation on multicore system. Quantum Cloud Future (Beijing) Technology Co. Ltd.

  4. Introduction to Quantum Cloud Then we found GPU is a great tool for both molecular dynamics and matrix operation. Quantum Cloud Future (Beijing) Technology Co. Ltd.

  5. Introduction to Quantum Cloud Later we found similar systems with massive CPU cores. Quantum Cloud Future (Beijing) Technology Co. Ltd.

  6. Introduction to Quantum Cloud Today we will show some practical example about our scale-out algorithm on these systems Quantum Cloud Future (Beijing) Technology Co. Ltd.

  7. System and Cores: Communication Matters 100,000 Super Computer 10,000 General-purpose Number of Cores 1,000 Blade Server 100 10 PC Server 1 10 100 1000 10K 100k 1M System Power Consumption (Watts) QCF & SOCIONEXT Quantum Cloud Future (Beijing) Technology Co. Ltd.

  8. System and Cores: Communication Matters 100,000 Super Special-purpose Computer GPU Cluster 10,000 General-purpose Number of Cores 1,000 GPU Blade Server 100 10 PC Server 1 10 100 1000 10K 100k 1M System Power Consumption (Watts) QCF & SOCIONEXT Quantum Cloud Future (Beijing) Technology Co. Ltd.

  9. System and Cores: Communication Matters 100,000 Super Special-purpose Computer GPU Cluster 10,000 General-purpose Number of Cores 1,000 GPU Blade Server 100 Traditional ARM Server 10 ARM PC SoC Server 1 10 100 1000 10K 100k 1M System Power Consumption (Watts) QCF & SOCIONEXT Quantum Cloud Future (Beijing) Technology Co. Ltd.

  10. System and Cores: Communication Matters General-purpose 100,000 Super Special-purpose Computer Microslides GPU Cluster Microslides 10,000 of ARM SoC General-purpose Microslides Number of Cores of ARM CPU 1,000 GPU Blade Server 100 Traditional ARM Server 10 ARM PC SoC Server 1 10 100 1000 10K 100k 1M System Power Consumption (Watts) QCF & SOCIONEXT Quantum Cloud Future (Beijing) Technology Co. Ltd.

  11. System and Cores: Communication Matters General-purpose 100,000 Super Special-purpose Computer Microslides GPU Cluster Microslides 10,000 of ARM SoC General-purpose Microslides Number of Cores of ARM CPU 1,000 GPU Blade Server 100 Traditional ARM Server 10 ARM PC SoC Server 1 10 100 1000 10K 100k 1M 2006 2012 2018 System Power Consumption (Watts) cluster connection inter CPU connection intra CPU connection QCF & SOCIONEXT Quantum Cloud Future (Beijing) Technology Co. Ltd.

  12. Data Communication Between Systems Is Obstacle Networking Networking Memory Memory Sockets Bus Sockets Bus Cache L2/L3 Cache L2/L3 Intra CPU Fabric Intra CPU Fabric Cache L1 Cache L1 cores cores Hierarchical structure is critical for Von Neumann architecture I/O Cache/Storage Quantum Cloud Future (Beijing) Technology Co. Ltd.

  13. Data Communication Between Systems Is Obstacle Networking Networking Memory Memory Sockets Bus Sockets Bus Cache L2/L3 Cache L2/L3 Intra CPU Fabric Intra CPU Fabric Cache L1 Cache L1 cores cores I/O Cache/Storage Quantum Cloud Future (Beijing) Technology Co. Ltd.

  14. Data Communication Between Systems Is Obstacle algorithm-level Networking Networking parallelism Memory Memory OS-level Sockets Bus Sockets Bus Cache L2/L3 Cache L2/L3 parallelism Intra CPU Fabric Intra CPU Fabric instruction-level Cache L1 Cache L1 parallelism cores cores I/O Cache/Storage Quantum Cloud Future (Beijing) Technology Co. Ltd.

  15. Data Communication Between Systems Is Obstacle batch, share-nothing stateless computing algorithm-level Networking Networking parallelism big RAM Memory Memory avoid context switching OS-level Sockets Bus Sockets Bus TLB, cache-conscious Cache L2/L3 Cache L2/L3 parallelism big.LITTLE Intra CPU Fabric Intra CPU Fabric instruction-level GPU, FPGA Cache L1 Cache L1 parallelism cores cores Fast cache, cache prefetch Vector processing, SIMD/AVX I/O Cache/Storage Quantum Cloud Future (Beijing) Technology Co. Ltd.

  16. Data Communication Between Systems Is Obstacle batch, share-nothing stateless computing algorithm-level Networking Networking parallelism big RAM Memory Memory avoid context switching OS-level Sockets Bus Sockets Bus TLB, cache-conscious Cache L2/L3 Cache L2/L3 parallelism big.LITTLE Intra CPU Fabric Intra CPU Fabric instruction-level GPU, FPGA Cache L1 Cache L1 parallelism cores cores Fast cache, cache prefetch Vector processing, SIMD/AVX Consolidation will be the next-wave innovation for Chip design and system optimization • IO consolidation: networking, bus, fabric • storage consolidation: memory, cache, networking buffer I/O Cache/Storage Quantum Cloud Future (Beijing) Technology Co. Ltd.

  17. Parallel and Scaling Quantum Cloud Future (Beijing) Technology Co. Ltd.

  18. Fabric-Based ARM SoC • PCIe Fabric for networking • 768 cores • c2c 10Gbps, 36 microsec latency • 1TB DDR4 RAM • 700 watts TDP per chassis watt/core ARM SoC 1 x86 16 ~ 25 GPU 0.3~0.5 From SOCIONEXT Quantum Cloud Future (Beijing) Technology Co. Ltd.

  19. Cluster Management Tools PBS openstack kubernetes mesos basic batch process kvm container container/noncontainer fast very fast very secure fast compatible with very flexible very stable secure process and container pro normally with MPI system-level isolation production ready production ready can be secure high overhead container app cons no isolation complexity slow not flexible enough scenario scientific calculation private cloud application CI Datacenter OS Quantum Cloud Future (Beijing) Technology Co. Ltd.

  20. 计算架构 Share-Nothing + Message Queue Architecture Stateless IO core core core use an “individual” core to do IO for the host to increase the throughput host Quantum Cloud Future (Beijing) Technology Co. Ltd.

  21. Example: PacketShader on GPU Quantum Cloud Future (Beijing) Technology Co. Ltd.

  22. Example: Rendering on Arm Render@Baremetal 并发情况下提⾼髙 3 倍 Intel ARM 4 3 2 1 0 buggy fishy cat bmps teeglasFX splash poked Render@Container 多实例禮并发情况下提⾼髙 1.8 倍 Baremetal 1container 2container 4container 2 1.5 1 0.5 0 bmw27 classroom bechmark Quantum Cloud Future (Beijing) Technology Co. Ltd.

  23. Example: Rendering on Arm 30 22.5 15 7.5 0 Intel arm SoC Intel arm SoC Intel arm SoC performace scaled 1 scaled 2 scaled 1: scaled performance with frequency and core number scaled 2: scaled performance with frequency and core number and watts Quantum Cloud Future (Beijing) Technology Co. Ltd.

  24. Example: AI on Arm Caffe@Container ARM vs Intel vs GPU (scaled) Intel ARM GPU 1070 1.6 1.2 0.8 0.4 0 CIFAR 10 - 1 CIFAR 10 -2 CIFAR 10 - 3 Quantum Cloud Future (Beijing) Technology Co. Ltd.

  25. Example: AI on Arm SoC Training 16 12 8 4 0 Intel SoC Intel SoC Intel SoC Intel SoC caffe scaled caffe darknet scaled darknet Inference 9 6.75 4.5 2.25 0 Intel SoC Intel SoC Intel SoC Intel SoC caffe scaled caffe darknet scaled darknet Quantum Cloud Future (Beijing) Technology Co. Ltd.

  26. 量勵⼦孑云未来(北磻京)信息科技有限公司(以下称量勵⼦孑云)是⼀丁家以影视⾏行降业为主的垂直⾏行降业云计算公司。 量勵⼦孑云专注于影视⾏行降业的云化,和国际知名影视公司和特效制作公司合作,为影视⾏行降业客户提供制作软件、图形⼯左作站、⾼髙性能存储、渲染服务等⼀丁站式解决⽅斺案等。 THANKS info@lzyco.com www.lzyco.com EMAIL WEBSITE 010-53518265 NUMBER ADDRESS 北磻京市朝阳区⼯左体北磻路露 8 号三⾥里離屯 SOHO 办公 A 座 2101

Recommend


More recommend