Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on - PowerPoint PPT Presentation

Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on Noah Wolfe 1 , Misbah Mubarak 2 , Nikhil Jain 3 , Jens Domke 4 , Abhinav Bhatele 3 , Christopher D. Carothers 1 , Robert B. Ross 2 1 Rensselaer Polytechnic InsDtute 2 Argonne NaDonal Laboratory 3 Lawrence Livermore NaDonal Laboratory 4 Technische Universität Dresden, Dresden, Germany

Outline • Introduc2on Mo=va=on • Background: ROSS, CODES, DUMPI • • Slim Fly Network Model Design • Verifica=on • Network visualiza=on • Large-scale configura=on • Single-job applica=on trace performance • PDES performance • • Summary 2

Mo=va=on Intel Xeon Phi MIC [2] [3] IBM Power9 CPU NVIDIA Volta GPU + [1] [1] [3] [1] http://images.anandtech.com/doci/8727/Processors.jpg [2] http://images.fastcompany.com/upload/Aubrey_Isle_die.jpg [3] http://www.anandtech.com/show/9151/intel-xeon-phi-cray-energy-supercomputers

Network Design Problem • Variables • Topology • Planes/Rails • Technology (link speed, switch radix) • Job Alloca=on • Rou=ng • Communica=on PaVerns • Ques2ons • Bandwidth? • Latency? • Job Interference? • Answer : Simula=on!!! http://farm7.static.flickr.com/6114/6322893212_22f00600a1_b.jpg

Approach DUMPI — DOE Application Traces AMG Crystal Router MultiGrid CODES MPI Simulation Layer Network Models Slim Fly Fat Tree ROSS — PDES Framework Sequential — Conservative — Optimistic

Background — ROSS • R ensselaer O p=mis=c S imula=on S ystem ( ROSS ) 97x speedup for 60x more hardware • Provides the parallel discrete-event 5.5e+11 Actual Sequoia Performance simula=on plaYorm for CODES 5e+11 Linear Performance (2 racks as base) simula=ons 4.5e+11 Event Rate (events/second) 4e+11 • Supports op=mis=c event 3.5e+11 scheduling using reverse 3e+11 2.5e+11 computa=on 2e+11 • Demonstrated super-linear 1.5e+11 speedup and is capable of 1e+11 5e+10 processing 500 billion events per 0 second with over 250 million LPs 2 8 24 48 96 120 Number of Blue Gene/Q Racks on 120 racks of the Sequoia supercomputer at LLNL 6

Background — CODES • CO - D esign of mul=-layer E xascale S torage systems ( CODES ) • Framework for exploring design of HPC interconnects, storage systems, and workloads • High-fidelity packet-level models of HPC interconnect topologies • Synthe=c or applica=on trace network and I/O workloads *Source: “Quantifying I/O and communication traffic interference on burst-buffer enabled dragonfly networks,” (submitted Cluster 17) 7

Background — DUMPI • DUMPI - The MPI Trace Library • Provides libraries to collect and read traces of MPI applica=ons • DOE Design Forward Traces • Variety of communica=on paVers and intensi=es of applica=ons at scale Fig. Distribu=on of MPI communica=on for AMG trace • AMG Source: nersc.gov • Crystal Router • Mul=grid 8

HPC Systems • Compute Nodes: ~4,600 Summit Supercomputer • CPU Cores: ~220,000 IBM Power9 CPU • Routers: ~500 NVIDIA Volta GPU • Cables: ~12,000 9

HPC Components … Routers/Switches … Compute Nodes … MPI Processes Discrete-Event Simulation Components • LPs: Logical Processes • PEs: Processing Elements • Events: Time stamped elements of work 10

Discrete-Event Mapping • R ensselaer O p=mis=c S imula=on S ystem ( ROSS ) • Logical Processes (LPs): • MPI processes (virtual) • Compute nodes • Network switches • Processing Elements (PEs): • MPI process (physical) • Events: PEs { • Network Messages Network Switch • Event Scheduling: LP Types • Sequen=al Compute Node • Conserva=ve MPI Process (virtual) • Op=mis=c MPI Process (physical) 11

Discrete-Event Implementa=on

Slim Fly Network Model

Slim Fly — Design • Descrip=on: • Built on MMS graphs • Max network diameter of 2 • Uses high-radix routers • Complex layout and connec=vity • Network Parameters: • q: number of routers per group and number of global connec=ons per router • p: number of terminal connec=ons per router, p=floor(k’/2) • k: router radix • k’: router network radix Fig. Slim Fly with q=5

Slim Fly — Network Model Routing Synthetic Workloads • Minimal rou=ng • Uniform Random (UR) • Max 2 hops • Worst-Case (WC) • Non-minimal rou=ng • Messages routed minimally to random intermediate router and then minimally to the des=na=on • Adap=ve rou=ng • Chooses minimal and non-minimal based on conges=on of source router Additional Components • Credit-based flow control • Virtual Channels ( VC s) to avoid rou=ng deadlocks • Minimal rou=ng: 2 vcs per port • Non-minimal rou=ng: 4 vcs per port

Slim Fly — Verifica=on Setup Minimal Routing • Comparison with published Slim Fly network results by Kathareios et al. [1] Slim Fly Configura=on Simula=on Configura=on • • • 3,042 nodes • link bandwidth: 12.5GB/s • 338 routers • link latency: 50ns • q=13 • buffer space: 100KB • k=28 • router delay: 100ns • Simulated =me: 220us [1] G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and T. Hoefler. Cost-Effective Diameter- Two Topologies: Analysis and Evaluation. Nov. 2015. IEEE/ACM ICHPCNSA (SC15). Adaptive Routing Non-Minimal Routing

Slim Fly — Network Visualiza=on • Uniform random traffic with minimal rou=ng Fig. Virtual channel occupancy for all router ports in network

Slim Fly — Network Visualiza=on Send/Receive Performance: • Uniform random traffic • Minimal rou=ng • 100% bandwidth injec=on load Fig. Number of sends and receives sampled over the simula=on

Slim Fly — Large-Scale • 74K Node (Aurora) System: • Simula=on Configura=on: • 2,738 routers • link bandwidth: 12.5GB/s • q=37, k=82, p=floor(k’/2)=27 • link latency: 50ns • 1M Node System: • buffer space: 100KB • 53,178 routers • router delay: 100ns • q=163, k=264, k’=244, • Simulated =me: 200us • ideal_p=122, actual_p=19

Applica=on Traces • Crystal Router: • Descrip2on : Mini-app for the highly scalable Nek5000 spectral element code developed at ANL [3]. • Communica2on PaHern : Large synchronized messages following n- dimensional hypercube (many-to-many) • Communica2on Time : 68.5% of run=me • Trace size : 1,000 MPI processes • Mul=grid: • Descrip2on : Implements a single produc=on cycle of the linear solver used in BoxLib [1], an adap=ve mesh refinement code. • Communica2on PaHern : Bursty periods of small messages along diagonals (many-to-many) • Communica2on Time : 5% of run=me • Trace sizes : 10,648 [1] Department of Energy, “AMR Box Lib.” [Online]. Available: https://ccse.lbl.gov/BoxLib/ [2] Co-design at Lawrence Livermore National Laboratory, “Algebraic Multigrid Solver (AMG).” (Accessed on: Apr. 19, 2015). [Online]. Available:https://codesign.llnl.gov/amg2013.php [3] J. Shin et al. “Speeding up nek5000 with autotuning and specialization,” in Proceedings of the 24th ACM International Conference for Supercomputing. ACM, 2010.

Crystal Router (CR) Simula=on: • Workload: 1,000 • End Time: 290us

Mul=grid (MG) Simula=on: • End Time: 290us • Workload: 10,648 ranks MG

Applica=on Traces Summary Applica=on MPI Ranks Virtual End Time (ns) Recvs GB Received Waits Wait Alls Avg Msg Size CR 1000 750866 263K 724MB 263K 263K 2890B MG 10648 44798942 2.6M 18.1GB 0 0 7480B • Summary: • CR: • Small quan=ty of medium sized messages • Synchroniza=on auer each message transfer • MG: • large quan=ty of large messages • No synchroniza=on

Slim Fly — Crystal Router (a) Simulation End Time (b) Packet Hops (c) Packet Latency (d) Network Congestion

Slim Fly — Mul=grid (a) Simulation End Time (b) Packet Hops (c) Packet Latency (d) Network Congestion

Slim Fly — PDES Scaling • 74K Node System: • 43M events/s • 543M events processed • 1M Node System: • 36M events/s • 7B events processed • Op=mis=c: • Ideal scaling • > 95% efficiency • Conserva=ve: • LiVle to no scaling performance

Slim Fly — PDES Analysis Op=mis=c: Conserva=ve: • Distribu=on of simula=on =me • Distribu=on of simula=on =me scales linearly scales linearly • Uniform distribu=on of work • Uniform distribu=on of work among PEs among PEs

Slim Fly — PDES Analysis Memory Consump=on: Time Slowdown: • Physical amount of system • A measure of how much slower memory required to ini=alize the simula=on is compared to the model and run the the real-world experiment simula=on being modeled (a) Memory Consumption (b) Time Slowdown

Summary • Slim fly network model : A new parallel discrete event slim fly network model capable of providing insight into network behavior at scale • Verifica2on : Verified the accuracy of the slim fly model with published results • Network performance analysis : Performed detailed analysis of the slim fly model in response to single job execu=ons of applica=on communica=on traces showing preferred rou=ng algorithms • PDES Analysis : Conducted strong scaling as well as discrete-event simula=on analysis showing the efficiency and scalability of the network model under both conserva=ve and op=mis=c event scheduling • Overall : U=lizing the discrete-event simula=on approach for large- scale HPC system simula=ons results in an effec=ve tool for analysis and co-design

Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on - PowerPoint PPT Presentation

Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on Noah Wolfe 1 , Misbah Mubarak 2 , Nikhil Jain 3 , Jens Domke 4 , Abhinav Bhatele 3 , Christopher D. Carothers 1 , Robert B. Ross 2 1 Rensselaer Polytechnic InsDtute 2 Argonne NaDonal

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Discrete-Event Systems and Generalized Semi-Markov Processes Discrete-Event Systems and

Stochastic Simulation Discrete simulation/event-by-event Bo Friis Nielsen Institute of

Stochastic Simulation Discrete simulation/event-by-event Bo Friis Nielsen Institute of

Extreme Event-Size Extreme Event-Size Fluctuations in Biased Fluctuations in Biased Random

Storage Lessons from HPC: Extreme Scale Computing Driving Economical Storage Solutions into

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS) Michael Kluge,

CompSci 514: Computer Networks Lecture 17: Datacenter Network Architectures Xiaowei Yang

Determinacy models and good scales at singular cardinals Trevor Wilson University of California,

CS 6453 Network Fabric Presented by Ayush Dubey Based on: 1. Jupiter Rising: A Decade of Clos

Housekeeping Tw itter: # ACMW ebinarScaling W elcom e to today s ACM Learning Webinar.

Data Center Switch Architecture in the Age of Merchant Silicon Nathan Farrington Erik Rubow

Symbiosis in Scale Out Networking and Data Management Amin Vahdat Google/UC San Diego

Recent Workload Characterization Activities at NERSC Harvey Wasserman NERSC Science Driven System