Understanding and Optimizing Communication Performance on HPC - PowerPoint PPT Presentation

Understanding and Optimizing Communication Performance on HPC Networks Contributors: Nikhil Jain, Abhinav Bhatele, Todd Gamblin, Xiang Ni, Michael Robson, Bilge Acun, Laxmikant Kale University of Illinois at Urbana-Champaign http://charm.cs.illinois.edu/~nikhil/ 1

Communication in HPC 100 Time spent in communication (%) 75 • A necessity, but can be viewed as OpenAtom an overhead 50 PF3D NAMD EpiSimdemics • Can consume half 25 MILC the execution time ClothSim 0 0 17500 35000 52500 70000 Cores 2

Communication in HPC Complex interplay of several components : hardware, configurable network properties, interaction patterns, algorithms… As a user, limited control over environment and interference As an admin, how to best use the system while keeping users happy 3

Communication in HPC Complex interplay of several MILC components : hardware, Diverse configurable network properties, apps interaction patterns, algorithms… As a user, limited control over OpenAtom environment and interference Dragonfly Many As an admin, how to best use the systems system while keeping users happy Torus 3

Topology Aware Mapping • Profile applications for their communication graphs and map them • Extremely important for Torus-based systems; ongoing work on other topologies 4

Topology Aware Mapping • Profile applications for their communication graphs and map them • Extremely important for Torus-based systems; ongoing work on other topologies • Use Case: OpenAtom 10" 1000" Scaling%for%MOF%on%Vulcan% Min+Def" Min+Topo" Time%per%step%(s)% BOMD+Def" BOMD+Topo" Time%per%step%(s)% 8" 100" Default" 6" Topo3aware" 4" 10" 2" 1" 0" 256" 512" 1024" 256" 512" 1024" 2048" Number%of%nodes%(each%node%is%64%threads)% Number%of%nodes%(each%node%is%64%threads)% 4

Rubik - Python based tool to create maps = = map() Application ranks mapped Application 3D Torus app network to the 3D torus 5

Rubik - Python based tool to create maps = = map() Application ranks mapped Application 3D Torus app network to the 3D torus MILC: Time spent in MPI calls on 4,096 nodes pF3D: Time spent in MPI calls on 4,096 nodes 500 160 Wait Alltoall Allreduce Send Isend Barrier 400 Irecv Recv 120 300 Time (s) Time (s) 80 200 40 100 0 0 Default RR Node Tile1 Tile2 Tile3 Tile4 Default RR Tile1 Tile2 Tile3 Tile4 Tilt 5 Different mappings Different mappings

Understanding Networks 6

Understanding Networks • What determines communication performance? • How can we predict it? • Quantification of metrics 6

Understanding Networks • What determines communication performance? • How can we predict it? • Quantification of metrics • What is the relation between performance and the entities quantified above? • Linear, higher polynomial, or indeterminate • Is statistical data related to performance? 6

Understanding Networks • What determines communication performance? • How can we predict it? • Quantification of metrics • What is the relation between performance and the entities quantified above? • Linear, higher polynomial, or indeterminate • Is statistical data related to performance? • Method 1: Supervised Learning • More on this in Abhinav’s talk 6

Method 2: Packet-level Simulation 7

Method 2: Packet-level Simulation • Detailed study of what-if scenarios • Comparison of similar systems 7

Method 2: Packet-level Simulation • Detailed study of what-if scenarios • Comparison of similar systems • BigSim was among the earliest accurate packet- level HPC network simulator (circa 2004) • Reviving Emulation and Simulation capabilities of BigSim • BigSim + CODES + ROSS = TraceR • More on this in the Bilge’s talk 7

Method 3: Modeling via Damselfly Intermediate methods sufficient to answer certain types of questions Q1: What is the best combination of routing strategies and job placement policies for single jobs? Q2: What is the best combination for parallel job workloads? Q3: Should the routing policy be job-specific or system-wide? 8

Dragonfly Topology Level 1: Dense connectivity among routers to form groups IBM PERCS CRAY ARIES/XC30 9

Dragonfly Topology Level 2: Dense connectivity among groups as virtual routers IBM PERCS CRAY ARIES/XC30 � � � �� 9

What needs to be evaluated? Job Placement Routing Comm Kernel Random Nodes (RDN) Static Direct (SD) UnStructured Random Routers (RDR) Static Indirect (SI) 2D Stencil Random Chassis (RDC) Adaptive Direct (AD) 4D Stencil Random Group (RDG) Adaptive Indirect (AI) Many-to-many Round Robin Nodes (RRN) Adaptive Hybrid (AH) Spread Round Robin Routers (RRR) Job-specific (JS) Parallel Workloads (4) Total cases ~ 360 for 8.8 million cores with 92,160 routers 10

Model for link utilization • Input to the model: 1. Network graph of Dragonfly routers 2. Application communication graph for a communication step 3. Job placement 4. Routing strategy • Output: The steady-state traffic distribution on all network links , which is representative of the network throughput • Implemented as a scalable parallel MPI program executed on Blue Gene/Q   — Maximum runtime of 2 hours on 8,192 cores for prediction on 8.8 million cores 11

Initialize two copies of network graph N :   • N Alloc : stores total and per message allocated bandwidth ( = 0)   N Remain : stores bandwidth available for allocation (= capacity) 12

Initialize two copies of network graph N :   • N Alloc : stores total and per message allocated bandwidth ( = 0)   N Remain : stores bandwidth available for allocation (= capacity) Start with 10 GB/s Iterative solve for computing representative state • per link N Alloc   S while a message is allocated additional bandwidth for each message m, obtain the list of paths P(m) • D 12

Initialize two copies of network graph N :   • N Alloc : stores total and per message allocated bandwidth ( = 0)   N Remain : stores bandwidth available for allocation (= capacity) Start with 10 GB/s Iterative solve for computing representative state • per link N Alloc   S while a message is allocated additional bandwidth for each message m, obtain the list of paths P(m) • P3 P1 P2 D 12

Initialize two copies of network graph N :   • N Alloc : stores total and per message allocated bandwidth ( = 0)   N Remain : stores bandwidth available for allocation (= capacity) Start with 10 GB/s Iterative solve for computing representative state • per link N Alloc   S while a message is allocated additional bandwidth for each message m, obtain the list of paths P(m) • using P(m) of all messages, find the request • count for each link P3 P1 P2 D 12

Initialize two copies of network graph N :   • N Alloc : stores total and per message allocated bandwidth ( = 0)   N Remain : stores bandwidth available for allocation (= capacity) Start with 10 GB/s Iterative solve for computing representative state • per link N Alloc   S while a message is allocated additional bandwidth for each message m, obtain the list of paths P(m) • 1 1 using P(m) of all messages, find the request • 2 count for each link P3 P1 2 3 P2 D 12

Understanding and Optimizing Communication Performance on HPC - PowerPoint PPT Presentation

Understanding and Optimizing Communication Performance on HPC Networks Contributors: Nikhil Jain, Abhinav Bhatele, Todd Gamblin, Xiang Ni, Michael Robson, Bilge Acun, Laxmikant Kale University of Illinois at Urbana-Champaign

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Fractal Prefetching B+-Trees: Optimizing Both Cache and Disk Performance Shimin Chen, Phillip B.

Optimizing Communication on Blue Waters Torsten Hoefler PRAC Workshop, Oct. 19 th 2010 T. Hoefler

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

Optimizing for Space and Time Optimizing for Space and Time Usage with Speculative Par Usage

Fast Rope Optimizing IDIQs as a Prime and a S ub Feb 2016 Optimizing IDIQ and GWACs Prime

Matching and Optimizing the Matching and Optimizing the SILC / ILC sections SILC / ILC sections

Optimizing Compilers Source Optimization (ideal case) Performance Front End Introduction

Rcpp classes and vectors Romain Franois Consulting Datactive, ThinkR DataCamp Optimizing R

Optimizing Application Performance in Large Multi-core Systems Waiman Long / Aug 19, 2015 HP

Optimizing the Management of Acute Myeloid Leukemia: Individualized Therapy Optimizing the

Overlapping Communication and Computation with High Level Communication Routines - On Optimizing

Optimizing the Truckload / Less Than Truckload (TL/LTL) Optimizing the Truckload / Less Than

A Case for Self-Optimizing File Systems Jason Liptak, Sam Burnett A Case for Self-Optimizing

Street Skateboarding: Endless Grinds And Slides: An Instructional Look At Curb Tricks Download

Software Testing Software Testing CISC 323 Winter 2006 Prof. Lamb Prof. Kelly

CS 4803 / 7643: Deep Learning Website: www.cc.gatech.edu/classes/AY2019/cs7643_fall/ Piazza:

The scientific process as cumulative 15 January 2020 Modern Research Methods Molly Lewis

Lars Kurth Community Manger, Xen Project Chairman, Xen Project Advisory Board lars_kurth

MarFS : A Scalable Near-POSIX File System over Cloud Objects Kyle E. Lamb HPC Storage Team Lead

Monthly Webinar Series September 2019 Todays Agenda Announcements & Trial Updates Sandi

Overview of the New Leases Standard MBAF Welcomes You! October 25, 2018 1 Agenda Overview

Understanding and Optimizing Communication Performance on HPC - PowerPoint PPT Presentation

Understanding and Optimizing Communication Performance on HPC Networks Contributors: Nikhil Jain, Abhinav Bhatele, Todd Gamblin, Xiang Ni, Michael Robson, Bilge Acun, Laxmikant Kale University of Illinois at Urbana-Champaign

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Fractal Prefetching B+-Trees: Optimizing Both Cache and Disk Performance Shimin Chen, Phillip B.

Optimizing Communication on Blue Waters Torsten Hoefler PRAC Workshop, Oct. 19 th 2010 T. Hoefler

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

Optimizing for Space and Time Optimizing for Space and Time Usage with Speculative Par Usage

Fast Rope Optimizing IDIQs as a Prime and a S ub Feb 2016 Optimizing IDIQ and GWACs Prime

Matching and Optimizing the Matching and Optimizing the SILC / ILC sections SILC / ILC sections

Optimizing Compilers Source Optimization (ideal case) Performance Front End Introduction

Rcpp classes and vectors Romain Franois Consulting Datactive, ThinkR DataCamp Optimizing R

Optimizing Application Performance in Large Multi-core Systems Waiman Long / Aug 19, 2015 HP

Optimizing the Management of Acute Myeloid Leukemia: Individualized Therapy Optimizing the

Overlapping Communication and Computation with High Level Communication Routines - On Optimizing

Optimizing the Truckload / Less Than Truckload (TL/LTL) Optimizing the Truckload / Less Than

A Case for Self-Optimizing File Systems Jason Liptak, Sam Burnett A Case for Self-Optimizing

Street Skateboarding: Endless Grinds And Slides: An Instructional Look At Curb Tricks Download

Software Testing Software Testing CISC 323 Winter 2006 Prof. Lamb Prof. Kelly

CS 4803 / 7643: Deep Learning Website: www.cc.gatech.edu/classes/AY2019/cs7643_fall/ Piazza:

The scientific process as cumulative 15 January 2020 Modern Research Methods Molly Lewis

Lars Kurth Community Manger, Xen Project Chairman, Xen Project Advisory Board lars_kurth

MarFS : A Scalable Near-POSIX File System over Cloud Objects Kyle E. Lamb HPC Storage Team Lead

Monthly Webinar Series September 2019 Todays Agenda Announcements &amp; Trial Updates Sandi

Overview of the New Leases Standard MBAF Welcomes You! October 25, 2018 1 Agenda Overview

Monthly Webinar Series September 2019 Todays Agenda Announcements & Trial Updates Sandi