Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on Noah Wolfe 1 , Misbah Mubarak 2 , Nikhil Jain 3 , Jens Domke 4 , Abhinav Bhatele 3 , Christopher D. Carothers 1 , Robert B. Ross 2 1 Rensselaer Polytechnic InsDtute 2 Argonne NaDonal Laboratory 3 Lawrence Livermore NaDonal Laboratory 4 Technische Universität Dresden, Dresden, Germany
Outline • Introduc2on Mo=va=on • Background: ROSS, CODES, DUMPI • • Slim Fly Network Model Design • Verifica=on • Network visualiza=on • Large-scale configura=on • Single-job applica=on trace performance • PDES performance • • Summary 2
Mo=va=on Intel Xeon Phi MIC [2] [3] IBM Power9 CPU NVIDIA Volta GPU + [1] [1] [3] [1] http://images.anandtech.com/doci/8727/Processors.jpg [2] http://images.fastcompany.com/upload/Aubrey_Isle_die.jpg [3] http://www.anandtech.com/show/9151/intel-xeon-phi-cray-energy-supercomputers
Network Design Problem • Variables • Topology • Planes/Rails • Technology (link speed, switch radix) • Job Alloca=on • Rou=ng • Communica=on PaVerns • Ques2ons • Bandwidth? • Latency? • Job Interference? • Answer : Simula=on!!! http://farm7.static.flickr.com/6114/6322893212_22f00600a1_b.jpg
Approach DUMPI — DOE Application Traces AMG Crystal Router MultiGrid CODES MPI Simulation Layer Network Models Slim Fly Fat Tree ROSS — PDES Framework Sequential — Conservative — Optimistic
Background — ROSS • R ensselaer O p=mis=c S imula=on S ystem ( ROSS ) 97x speedup for 60x more hardware • Provides the parallel discrete-event 5.5e+11 Actual Sequoia Performance simula=on plaYorm for CODES 5e+11 Linear Performance (2 racks as base) simula=ons 4.5e+11 Event Rate (events/second) 4e+11 • Supports op=mis=c event 3.5e+11 scheduling using reverse 3e+11 2.5e+11 computa=on 2e+11 • Demonstrated super-linear 1.5e+11 speedup and is capable of 1e+11 5e+10 processing 500 billion events per 0 second with over 250 million LPs 2 8 24 48 96 120 Number of Blue Gene/Q Racks on 120 racks of the Sequoia supercomputer at LLNL 6
Background — CODES • CO - D esign of mul=-layer E xascale S torage systems ( CODES ) • Framework for exploring design of HPC interconnects, storage systems, and workloads • High-fidelity packet-level models of HPC interconnect topologies • Synthe=c or applica=on trace network and I/O workloads *Source: “Quantifying I/O and communication traffic interference on burst-buffer enabled dragonfly networks,” (submitted Cluster 17) 7
Background — DUMPI • DUMPI - The MPI Trace Library • Provides libraries to collect and read traces of MPI applica=ons • DOE Design Forward Traces • Variety of communica=on paVers and intensi=es of applica=ons at scale Fig. Distribu=on of MPI communica=on for AMG trace • AMG Source: nersc.gov • Crystal Router • Mul=grid 8
HPC Systems • Compute Nodes: ~4,600 Summit Supercomputer • CPU Cores: ~220,000 IBM Power9 CPU • Routers: ~500 NVIDIA Volta GPU • Cables: ~12,000 9
HPC Components … Routers/Switches … Compute Nodes … MPI Processes Discrete-Event Simulation Components • LPs: Logical Processes • PEs: Processing Elements • Events: Time stamped elements of work 10
Discrete-Event Mapping • R ensselaer O p=mis=c S imula=on S ystem ( ROSS ) • Logical Processes (LPs): • MPI processes (virtual) • Compute nodes • Network switches • Processing Elements (PEs): • MPI process (physical) • Events: PEs { • Network Messages Network Switch • Event Scheduling: LP Types • Sequen=al Compute Node • Conserva=ve MPI Process (virtual) • Op=mis=c MPI Process (physical) 11
Discrete-Event Implementa=on
Slim Fly Network Model
Slim Fly — Design • Descrip=on: • Built on MMS graphs • Max network diameter of 2 • Uses high-radix routers • Complex layout and connec=vity • Network Parameters: • q: number of routers per group and number of global connec=ons per router • p: number of terminal connec=ons per router, p=floor(k’/2) • k: router radix • k’: router network radix Fig. Slim Fly with q=5
Slim Fly — Network Model Routing Synthetic Workloads • Minimal rou=ng • Uniform Random (UR) • Max 2 hops • Worst-Case (WC) • Non-minimal rou=ng • Messages routed minimally to random intermediate router and then minimally to the des=na=on • Adap=ve rou=ng • Chooses minimal and non-minimal based on conges=on of source router Additional Components • Credit-based flow control • Virtual Channels ( VC s) to avoid rou=ng deadlocks • Minimal rou=ng: 2 vcs per port • Non-minimal rou=ng: 4 vcs per port
Slim Fly — Verifica=on Setup Minimal Routing • Comparison with published Slim Fly network results by Kathareios et al. [1] Slim Fly Configura=on Simula=on Configura=on • • • 3,042 nodes • link bandwidth: 12.5GB/s • 338 routers • link latency: 50ns • q=13 • buffer space: 100KB • k=28 • router delay: 100ns • Simulated =me: 220us [1] G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and T. Hoefler. Cost-Effective Diameter- Two Topologies: Analysis and Evaluation. Nov. 2015. IEEE/ACM ICHPCNSA (SC15). Adaptive Routing Non-Minimal Routing
Slim Fly — Network Visualiza=on • Uniform random traffic with minimal rou=ng Fig. Virtual channel occupancy for all router ports in network
Slim Fly — Network Visualiza=on Send/Receive Performance: • Uniform random traffic • Minimal rou=ng • 100% bandwidth injec=on load Fig. Number of sends and receives sampled over the simula=on
Slim Fly — Large-Scale • 74K Node (Aurora) System: • Simula=on Configura=on: • 2,738 routers • link bandwidth: 12.5GB/s • q=37, k=82, p=floor(k’/2)=27 • link latency: 50ns • 1M Node System: • buffer space: 100KB • 53,178 routers • router delay: 100ns • q=163, k=264, k’=244, • Simulated =me: 200us • ideal_p=122, actual_p=19
Applica=on Traces • Crystal Router: • Descrip2on : Mini-app for the highly scalable Nek5000 spectral element code developed at ANL [3]. • Communica2on PaHern : Large synchronized messages following n- dimensional hypercube (many-to-many) • Communica2on Time : 68.5% of run=me • Trace size : 1,000 MPI processes • Mul=grid: • Descrip2on : Implements a single produc=on cycle of the linear solver used in BoxLib [1], an adap=ve mesh refinement code. • Communica2on PaHern : Bursty periods of small messages along diagonals (many-to-many) • Communica2on Time : 5% of run=me • Trace sizes : 10,648 [1] Department of Energy, “AMR Box Lib.” [Online]. Available: https://ccse.lbl.gov/BoxLib/ [2] Co-design at Lawrence Livermore National Laboratory, “Algebraic Multigrid Solver (AMG).” (Accessed on: Apr. 19, 2015). [Online]. Available:https://codesign.llnl.gov/amg2013.php [3] J. Shin et al. “Speeding up nek5000 with autotuning and specialization,” in Proceedings of the 24th ACM International Conference for Supercomputing. ACM, 2010.
Crystal Router (CR) Simula=on: • Workload: 1,000 • End Time: 290us
Mul=grid (MG) Simula=on: • End Time: 290us • Workload: 10,648 ranks MG
Applica=on Traces Summary Applica=on MPI Ranks Virtual End Time (ns) Recvs GB Received Waits Wait Alls Avg Msg Size CR 1000 750866 263K 724MB 263K 263K 2890B MG 10648 44798942 2.6M 18.1GB 0 0 7480B • Summary: • CR: • Small quan=ty of medium sized messages • Synchroniza=on auer each message transfer • MG: • large quan=ty of large messages • No synchroniza=on
Slim Fly — Crystal Router (a) Simulation End Time (b) Packet Hops (c) Packet Latency (d) Network Congestion
Slim Fly — Mul=grid (a) Simulation End Time (b) Packet Hops (c) Packet Latency (d) Network Congestion
Slim Fly — PDES Scaling • 74K Node System: • 43M events/s • 543M events processed • 1M Node System: • 36M events/s • 7B events processed • Op=mis=c: • Ideal scaling • > 95% efficiency • Conserva=ve: • LiVle to no scaling performance
Slim Fly — PDES Analysis Op=mis=c: Conserva=ve: • Distribu=on of simula=on =me • Distribu=on of simula=on =me scales linearly scales linearly • Uniform distribu=on of work • Uniform distribu=on of work among PEs among PEs
Slim Fly — PDES Analysis Memory Consump=on: Time Slowdown: • Physical amount of system • A measure of how much slower memory required to ini=alize the simula=on is compared to the model and run the the real-world experiment simula=on being modeled (a) Memory Consumption (b) Time Slowdown
Summary • Slim fly network model : A new parallel discrete event slim fly network model capable of providing insight into network behavior at scale • Verifica2on : Verified the accuracy of the slim fly model with published results • Network performance analysis : Performed detailed analysis of the slim fly model in response to single job execu=ons of applica=on communica=on traces showing preferred rou=ng algorithms • PDES Analysis : Conducted strong scaling as well as discrete-event simula=on analysis showing the efficiency and scalability of the network model under both conserva=ve and op=mis=c event scheduling • Overall : U=lizing the discrete-event simula=on approach for large- scale HPC system simula=ons results in an effec=ve tool for analysis and co-design
Recommend
More recommend