TSUBAME3 and ABCI: Supercomputer Architectures for HPC and AI / BD - PowerPoint PPT Presentation

TSUBAME3 and ABCI: Supercomputer Architectures for HPC and AI / BD Convergence Satoshi Matsuoka Professor, GSIC, Tokyo Institute of Technology / Director, AIST-Tokyo Tech. Big Data Open Innovation Lab / Fellow, Artificial Intelligence Research Center, AIST, Japan / Vis. Researcher, Advanced Institute for Computational Science, Riken GTC2017 Presentation 2017/05/09

Tremendous Recent Rise in Interest by the Japanese Government on Big Data, DL, AI, and IoT • Three national centers on Big Data and AI launched by three competing Ministries for FY 2016 (Apr 2015-) – METI – AIRC (Artificial Intelligence Research Center): AIST (AIST internal budget + > $200 million FY 2017), April 2015 • Broad AI/BD/IoT, industry focus – MEXT – AIP (Artificial Intelligence Platform): Riken and other institutions ($~50 mil), April 2016 Vice Minsiter • A separate Post-K related AI funding as well. Tsuchiya@MEXT Annoucing AIP • Narrowly focused on DNN estabishment – MOST – Universal Communication Lab: NICT ($50~55 mil) • Brain –related AI – $1 billion commitment on inter-ministry AI research over 10 years 2

2015- AI Research Center (AIRC), AIST Now > 400+ FTEs Effective Cycles among Research and Deployment of AI Deployment of AI in real businesses and society Big Sciences Security Manufacturing Institutions Innovative Health Care Network Services Industrial robots Bio-Medical Sciences Start-Ups Companies Retailing Elderly Care Material Sciences Communication Automobile Director: Standard Tasks Application Domains Technology transfer Technology transfer Jun-ichi Tsujii Common AI Platform Joint research Standard Data Starting Enterprises Common Modules Planning/Business Team Common Data/Models Planning/Business Team NLP, NLU Behavior Prediction Planning Image Recognition AI Research Framework ･･･ Text mining Mining & Modeling Recommend Control 3D Object recognition Matsuoka : Joint Data-Knowledge integration AI Brain Inspired AI appointment as ･･･ Model of Model of Model of Basal ganglia Ontology Bayesian net ･･･ “Designated” Fellow Hippocampus Cerebral cortex Logic & Probabilistic Knowledge Modeling since July 2017 Core Center of AI for Industry-Academia Co-operation

National Institute for Joint Lab established Feb. Tokyo Institute of Advanced Industrial Science 2017 to pursue BD/AI joint Technology / GSIC and Technology (AIST) research using large-scale HPC BD/AI infrastructure Resources and Acceleration of Ministry of Economics Tsubame 3.0/2.5 AI / Big Data, systems research Trade and Industry (METI) Big Data /AI resources AIST Artificial Intelligence ITCS Research Center Joint Departments (AIRC) Research on Director: Satoshi Matsuoka AI / Big Data Application Area Basic Research and Natural Langauge Industrial in Big Data / AI applications Processing Collaboration in data, algorithms and Other Big Data / AI Robotics applications methodologies research organizations Security Industry and proposals JST BigData CREST ABCI JST AI CREST AI Bridging Cloud Etc. Infrastructure

Characteristics of Big Data and AI Computing As BD / AI As BD / AI Dense LA: DNN Graph Analytics e.g. Social Networks Inference, Training, Generation Sort, Hash, e.g. DB, log analysis Symbolic Processing: Traditional AI Opposite ends of HPC computing spectrum, but HPC simulation As HPC T ask apps can also be As HPC T ask Dense Matrices, Reduced Precision categorized likewise Integer Ops & Sparse Matrices Dense and well organized neworks Data Movement, Large Memory and Data Sparse and Random Data, Low Locality Acceleration via Acceleration, Scaling Acceleration, Scaling Supercomputers adapted to AI/BD

(Big Data) BYTES capabilities, in bandwidth and capacity , unilaterally important but often missing from modern HPC machines in their pursuit of FLOPS… • Need BOTH bandwidth and capacity Our measurement on breakdown of one iteration (BYTES) in a HPC-BD/AI machine: of CaffeNet training on • Obvious for lefthand sparse ,bandwidth- TSUBAME-KFC/DL (Mini-batch size of 256) dominated apps • But also for righthand DNN: Strong scaling, Proper arch. to large networks and datasets, in particular support large for future 3D dataset analysis such as CT- memory cap. Computation on GPUs scans, seismic simu. vs. analysis…) occupies only 3.9% and BW , network latency and BW important (Source: http://www.dgi.com/images/cvmain_overview/CV4DOverview_Model_001.jpg) Number of nodes (Source: https://www.spineuniverse.com/image- library/anterior-3d-ct-scan-progressive-kyphoscoliosis)

Th The c e current s status of of AI AI & Big D Data a in J Jap apan We e need need the t e triag age o e of advanced ced algorithm hms/infrastruc ucture/da data but w t we lac ack k the he cutting ng e edge i infrastruc uctur ure dedi dedicated ed to AI AI & Bi Big D Data (c.f. H HPC) C) AI Venture Startups R& R&D M ML Big Companies AI/BD Joint RWBC R&D (also Science) Algor orithms AIST-AIRC Open Innov. Seeking Innovative Lab (OIL) & SW SW (Director: Matsuoka) Application of AI & AI/BD Centers & Riken Data Labs in National NICT- -AIP Labs & Universities UCRI Use of Massive Scale Data now Massive Rise in Computing Over $1B Govt. Wasted Requirements (1 AI-PF/person?) AI investment Petabytes of Drive Recording Video over 10 years FA&Robots B In HPC , Cloud continues to Web access and be insufficient for cutting merchandice edge research => AI& I&Data Massive “Big” Data in IoT Communication, “Big ig”Da ”Data ta dedicated SCs dominate & Infrast structures location & other data Training racing to Exascale

2017 2017 Q2 Q2 TSUBAM TSU SUBA BAME AME3.0 Leading M ME3.0 g Mach chine T Towards Exa xa & B Big Data 1.“Everybody’s Supercomputer” - High Performance (12~24 DP Petaflops, 125~325TB/s Mem, 55~185Tbit/s NW), innovative high cost/performance packaging & design, in mere 180m 2 … 2.“Extreme Green” – ~10GFlops/W power-efficient architecture, system-wide power control, advanced cooling, future energy reservoir load leveling & energy recovery 3.“Big Data Convergence” – BYTES-Centric Architecture, Extreme high BW & capacity, deep memory 2013 hierarchy, extreme I/O acceleration, Big Data SW Stack TSUBAME2.5 for machine learning, graph processing, … upgrade 5.7PF DFP 2017 TSUBAME3.0+2.5 4.“Cloud SC” – dynamic deployment, container-based /17.1PF SFP ~18PF(DFP) 4~5PB/s Mem BW node co-location & dynamic configuration, resource 20% power 10GFlops/W power efficiency reduction elasticity, assimilation of public clouds… Big Data & Cloud Convergence 5.“Transparency” - full monitoring & user visibility of machine & job state, 2010 TSUBAME2.0 accountability 2.4 Petaflops #4 World “Greenest Production SC” via reproducibility Large Scale Simulation 2006 TSUBAME1.0 2013 TSUBAME-KFC Big Data Analytics 80 Teraflops, #1 Asia #7 World 8 #1 Green 500 Industrial Apps “Everybody’s Supercomputer” 2011 ACM Gordon Bell Prize

TSUBAME-KFC/DL: TSUBAME3 Prototype [ICPADS2014] Oil Immersive Cooling ＋ Hot Water Cooling + High Density Packaging + Fine- Grained Power Monitoring and Control, upgrade to /DL Oct. 2015 High Temperature Cooling Cooling Tower ： Oil Loop 35~45 ℃ Water 25~35 ℃ ⇒ Water Loop 25~35 ℃ ⇒ To Ambient Air (c.f. TSUBAME2: 7~17 ℃ ) Single Rack High Density Oil 2013 年 11 月 /2014 年 6 月 Immersion Word #1 Green500 168 NVIDIA K80 GPUs + Xeon 413+TFlops (DFP) Container Facility 1.5PFlops (SFP) 20 feet container (16m 2 ) ~60KW/rack Fully Unmanned Operation

Overview of TSUBAME3.0 BYTES-centric Architecture, Scalaibility to all 2160 GPUs, all nodes, the entire memory hiearchy Full Operations Aug. 2017 Full Bisection Bandwidgh Intel Omni-Path Interconnect. 4 ports/node Full Bisection / 432 Terabits/s bidirectional ~x2 BW of entire Internet backbone traffic DDN Storage (Lustre FS 15.9PB+Home 45TB) 540 Compute Nodes SGI ICE XA + New Blade Intel Xeon CPU x 2+NVIDIA Pascal GPUx4 (NV-Link) 256GB memory 2TB Intel NVMe SSD 47.2 AI-Petaflops, 12.1 Petaflops

TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC Intra-node GPU via NVLink Terabit class network/node Intra-node GPU via NVLink 20~40GB/s 800Gbps (400+400) 20~40GB/s full bisection HBM2 Inter-node GPU via OmniPath 64GB 12.5GB/s fully switched 2.5TB/s Any “Big” Data in the DDR4 system can be moved 256GB to anywhere via 150GB/s RDMA speeds minimum Intel Optane 12.5GBytes/s 1.5TB 12GB/s also with Stream 16GB/s PCIe 16GB/s PCIe (planned) Processing Fully Switched Fully Switched NVMe Flash Scalable to all 2160 2TB 3GB/s GPUs, not just 8 11 ~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node)  Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year

TSUBAME3 and ABCI: Supercomputer Architectures for HPC and AI / BD - PowerPoint PPT Presentation

TSUBAME3 and ABCI: Supercomputer Architectures for HPC and AI / BD Convergence Satoshi Matsuoka Professor, GSIC, Tokyo Institute of Technology / Director, AIST-Tokyo Tech. Big Data Open Innovation Lab / Fellow, Artificial Intelligence Research

Present and Future Present and Future Supercomputer Architectures Supercomputer Architectures

Overview Overview Processors Interconnects A few machines Examine the Top242 2 1

System X Virgina Tech's Supercomputer The fastest academic supercomputer Project #2

Drawings Is Issuance to House-Owners ABCi QP Dialogue Thursday 07/11/2019 There are 2 parts

Status of the Next-Generation Supercomputer Project YOKOKAWA, Mitsuo Next-Generation

Architectures Architectural styles Software architectures Architectures versus middleware

Overview Agent Architectures Definition of agent architecture Classical Architectures for

An Intelligent Rule-Oriented Data Management System Wayne Schroeder San Diego Supercomputer

Using Charm++ to Support you Multiscale Multiphysics On the Trinity Supercomputer nt wo

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Device I/O I/O architectures: busses 10A I/O Architectures 10B I/O Mechanisms CPU

Device I/O I/O architectures: busses 10A I/O Architectures 10B I/O Mechanisms CPU

Linpack Evaluation on Linpack Evaluation on a Supercomputer with p p Heterogeneous

Performance Optimization on a Performance Optimization on a Supercomputer with cTuning and

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS

ASSESSING THE BEHAVIOR OF HPC USERS AND SYSTEMS: THE CASE OF THE SANTOS DUMONT SUPERCOMPUTER

ANALYZING PERFORMANCE OF CONTAINERIZED CLIMATE MODELS IN SINGULARITY ON A SUPERCOMPUTER COMPUTER

Molecular Dynamics (MD) on GPUs May 5, 2016 Accelerating Discoveries Using a supercomputer

Molecular Dynamics (MD) on GPUs Feb. 2, 2017 Accelerating Discoveries Using a supercomputer

The First Supercomputer with HyperX Topology A Viable Alternative to Fat-Trees? Outline

Existing Regional Architectures in Asia Existing Regional Architectures in Asia - Association of

PROCESSOR SYSTEM 387 Acknowledgements Results originate in project "Supercomputer in the

The Ohio Supercomputer Center provides high performance computing services and computational

TSUBAME3 and ABCI: Supercomputer Architectures for HPC and AI / BD - PowerPoint PPT Presentation

TSUBAME3 and ABCI: Supercomputer Architectures for HPC and AI / BD Convergence Satoshi Matsuoka Professor, GSIC, Tokyo Institute of Technology / Director, AIST-Tokyo Tech. Big Data Open Innovation Lab / Fellow, Artificial Intelligence Research

Present and Future Present and Future Supercomputer Architectures Supercomputer Architectures

Overview Overview Processors Interconnects A few machines Examine the Top242 2 1

System X Virgina Tech's Supercomputer The fastest academic supercomputer Project #2

Drawings Is Issuance to House-Owners ABCi QP Dialogue Thursday 07/11/2019 There are 2 parts

Status of the Next-Generation Supercomputer Project YOKOKAWA, Mitsuo Next-Generation

Architectures Architectural styles Software architectures Architectures versus middleware

Overview Agent Architectures Definition of agent architecture Classical Architectures for

An Intelligent Rule-Oriented Data Management System Wayne Schroeder San Diego Supercomputer

Using Charm++ to Support you Multiscale Multiphysics On the Trinity Supercomputer nt wo

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Device I/O I/O architectures: busses 10A I/O Architectures 10B I/O Mechanisms CPU

Device I/O I/O architectures: busses 10A I/O Architectures 10B I/O Mechanisms CPU

Linpack Evaluation on Linpack Evaluation on a Supercomputer with p p Heterogeneous

Performance Optimization on a Performance Optimization on a Supercomputer with cTuning and

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS

ASSESSING THE BEHAVIOR OF HPC USERS AND SYSTEMS: THE CASE OF THE SANTOS DUMONT SUPERCOMPUTER

ANALYZING PERFORMANCE OF CONTAINERIZED CLIMATE MODELS IN SINGULARITY ON A SUPERCOMPUTER COMPUTER

Molecular Dynamics (MD) on GPUs May 5, 2016 Accelerating Discoveries Using a supercomputer

Molecular Dynamics (MD) on GPUs Feb. 2, 2017 Accelerating Discoveries Using a supercomputer

The First Supercomputer with HyperX Topology A Viable Alternative to Fat-Trees? Outline

Existing Regional Architectures in Asia Existing Regional Architectures in Asia - Association of

PROCESSOR SYSTEM 387 Acknowledgements Results originate in project &quot;Supercomputer in the

The Ohio Supercomputer Center provides high performance computing services and computational

PROCESSOR SYSTEM 387 Acknowledgements Results originate in project "Supercomputer in the