eaking Br 56 nd Ba A Breakdown of High- performance - PowerPoint PPT Presentation

� 1 35 eaking Br 56 nd Ba A Breakdown of High- performance Communication Rohit Zambre,* Megan Grodowitz, ⌃ Aparna Chandramowlishwaran,* Pavel Shamis ⌃ *University of California, Irvine ⌃ Arm Research

� 2 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

� 3 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ Evolution of the memory capacity per core in the Top500 list   (Peter Kogge. Pim & memory: The need for a revolution in architecture.)

� 4 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ ▸ Strong scaling is the way forward. ▸ Small messages at the limits of strong scaling. Evolution of the memory capacity per core in the Top500 list   (Peter Kogge. Pim & memory: The need for a revolution in architecture.)

� 5 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds

� 6 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds

� 7 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds

� 8 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? ▸ If we optimize Injection overhead component X by Y%, by how much will Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% communication performance improve? 0 100 200 Nanoseconds

� 9 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance.

� 10 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other system configuration.

� 11 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other system configuration. ▸ What-if analysis for a set of optimizations. ▸ First work of its kind on an Arm-based server.

� 12 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION OUTLINE ▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations

� 13 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INTERNODE COMMUNICATION COMPONENTS IN HPC Examples MPICH + UCP High-level Communication Protocols (HLP) CPU UCT Low-level Communication Protocols (LLP) Root Complex + PCI Express I/O subsystem I/O NIC Mellanox InifniBand Network Switch

        � 14 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP Node   Node   Mellanox 1   2   InfiniBand Mellanox   Lecroy   Mellanox   Network ConnectX-4   PCIe ConnectX-4   (Switch   NIC Analyzer NIC +   TX2-based TX2-based Wire) Server Server ▸ Software: MPICH CH4 + UCX; Hardware: Arm TX2 + PCIe + Mellanox IB ▸ CPU timer registers to measure CPU time. ▸ PCIe analyzer to measure time in other components through traces.

� 15 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE) State-of-the-art PCIe trace viewer cooling PCIe analyzer ConnectX-4 Node 1

� 16 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE) State-of-the-art PCIe trace viewer cooling PCIe analyzer ConnectX-4 Node 1

� 17 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING CPU TIMERS Timer start <code>   <of>   <interest> Timer end Time for code of interest = Timer end - Timer start - Timer overhead

� 18 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING CPU TIMERS MPI_Isend MPI ucp_tag_send_nb UCP uct_ep_am_short UCT ▸ Measured time in different components using deltas. ▸ Carefully isolated callbacks/functions between layers (details in paper).

� 19 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING PCIE ANALYZER Time of event = Timestamp of packet after event -   Timestamp of packet before event

� 20 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING PCIE ANALYZER NIC WRITING COMPLETION TLP   N   Root   Analyzer MWr 2 ✕ I   Complex   PCIe   DLLP   C (RC) wire ACK

� 21 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION OUTLINE ▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations

� 22 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD

� 23 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND Sender Programmed   IO   CPU Post Root   N   Complex   I   (RC) C MEM

    � 24 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND Sender Programmed   IO   CPU Post MWr (64B) Root   N   PCIe wire Complex   I   (RC) C MEM

    � 25 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed   IO   CPU Post MWr (64B) Transmit Root   N   PCIe wire Complex   I   (RC) C MEM ACK

    � 26 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed   IO   CPU Post MWr (64B) Transmit Root   N   PCIe wire Complex   I   (RC) C Write MEM completion MWr (64B) ACK

    � 27 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND INJECTION OVERHEAD Sender Programmed   IO   CPU Post MWr (64B) Transmit Root   N   PCIe wire Complex   I   (RC) C Write MEM Completion   completion DMA-write MWr (64B) ACK

    � 28 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed   IO   CPU Post MWr (64B) Transmit Root   N   PCIe wire Complex   I   Progress (RC) C Write MEM Completion   completion DMA-write MWr (64B) ACK

    � 29 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed   IO   CPU Post MWr (64B) Transmit Root   N   PCIe wire Complex   I   Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion   completion DMA-write MWr (64B) ACK

    � 30 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed   b ✕ Post + b ✕ Progress + tot_Misc IO   CPU Post MWr (64B) Transmit b Root   N   PCIe wire = CPU_time = Post + Progress + Misc Complex   I   Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion   completion DMA-write MWr (64B) ACK

    � 31 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed   b ✕ Post + b ✕ Progress + tot_Misc IO   CPU Post MWr (64B) Transmit b Root   N   PCIe wire = CPU_time = Post + Progress + Misc Complex   I   Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion   completion DMA-write MWr (64B) ACK (1) Credit-based flow control   (2) Multiple outstanding PCIe transactions

    � 32 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed   b ✕ Post + b ✕ Progress + tot_Misc IO   CPU Post MWr (64B) Transmit b Root   N   PCIe wire = CPU_time = Post + Progress + Misc Complex   I   Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion   completion DMA-write MWr (64B) ACK = Overhead observed by RC (1) Credit-based flow control   (2) Multiple outstanding PCIe transactions

� 33 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD Injection overhead = CPU_time = Post + Progress + Misc CPU timer registers

� 34 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION Misc Progress Post Post is performance Progress is semantic 1.20% 22.58% 76.23% bottleneck bottleneck 0 25 50 75 100 Percent

eaking Br 56 nd Ba A Breakdown of High- performance - PowerPoint PPT Presentation

1 35 eaking Br 56 nd Ba A Breakdown of High- performance Communication Rohit Zambre,* Megan Grodowitz, Aparna Chandramowlishwaran,* Pavel Shamis *University of California, Irvine Arm Research 2

DrK DrK: Brea eaking Ker ernel el Addres ess Space e La Layout ut Ra Rando ndomi

Break eaking ng through ugh to the e other her side Secrets of Science Story Telling or How

Impact on E-Commerce 35 Br eaking 56 Ba d Performance 1

Fr From om a a So Soft ftware e Tes Tester er To a o a L Lea eader der How To Take A

Specialisation (CoS) Presentation to BUSA: 05 February 2020 Centr Centres es of of

Engineer ing your futur e Delivering Global Manufacturing Solutj ons Sylat ech is a gr oundbr

DECOMPOSITION OF COONTAIL ( CERATOPHYLLUM DEMERSUM ) IN EUTROPHIC AND OLIGOTROPHIC ENVIRONMENTS

NMO in presence of NSI N. R. Khan Chowdhury 1 5th Nov 2019 | N. R. Khan Chowdhury | Group

Runtime Tracing of the Community Earth System Model: Feasibility Study and Benefits ICCS12

Implementation and Analysis of Nonblocking Collective Operations on SCI Networks Christian Kaiser

Peer-to-Peer Networks The Internet 6th Week Albert-Ludwigs-Universitt Freiburg Department of

Laboratory Astrophysics and Stardust Natalia Ruiz Zelmanovitch - @bynzelman Public Information

Opportunities for international DUNE communication and outreach Kurt Riesselmann kurtr@fnal.gov

ADVANCED DATABASE SYSTEMS Query Execution & Processing @ Andy_Pavlo // 15- 721 // Spring

Abstract Generation Advanced VLSI Design CMPE 414 Abstract Generation Place and route tools do

xJS Elias Athanasopoulos, FORTH-ICS 2 xJS Elias

The Simulation Pipeline phenomenon, process etc. Scientific Computing I modelling v

Outline Background and Motivation Research Questions Serverless Application

The CMS Inner Tracker Upgrade for the HL-LHC Malte Backhaus for the CMS Collaboration

ITS LIT Mid-Year Design Review Senior Design Project 17 Department of Electrical and Computer

Computational Logic and Human Reasoning Steffen H olldobler International Center for

Modeling and Reasoning in Event Calculus Using Goal-Directed Constraint Answer Set Programming J.

Argument Strength and Probability Henry Prakken Workshop on Argument Strength Bochum

ModelPlex: Verified Runtime Validation of Verified Cyber-Physical System Models Stefan Mitsch

eaking Br 56 nd Ba A Breakdown of High- performance - PowerPoint PPT Presentation

1 35 eaking Br 56 nd Ba A Breakdown of High- performance Communication Rohit Zambre,* Megan Grodowitz, Aparna Chandramowlishwaran,* Pavel Shamis *University of California, Irvine Arm Research 2

DrK DrK: Brea eaking Ker ernel el Addres ess Space e La Layout ut Ra Rando ndomi

Break eaking ng through ugh to the e other her side Secrets of Science Story Telling or How

Impact on E-Commerce 35 Br eaking 56 Ba d Performance 1

Fr From om a a So Soft ftware e Tes Tester er To a o a L Lea eader der How To Take A

Specialisation (CoS) Presentation to BUSA: 05 February 2020 Centr Centres es of of

Engineer ing your futur e Delivering Global Manufacturing Solutj ons Sylat ech is a gr oundbr

DECOMPOSITION OF COONTAIL ( CERATOPHYLLUM DEMERSUM ) IN EUTROPHIC AND OLIGOTROPHIC ENVIRONMENTS

NMO in presence of NSI N. R. Khan Chowdhury 1 5th Nov 2019 | N. R. Khan Chowdhury | Group

Runtime Tracing of the Community Earth System Model: Feasibility Study and Benefits ICCS12

Implementation and Analysis of Nonblocking Collective Operations on SCI Networks Christian Kaiser

Peer-to-Peer Networks The Internet 6th Week Albert-Ludwigs-Universitt Freiburg Department of

Laboratory Astrophysics and Stardust Natalia Ruiz Zelmanovitch - @bynzelman Public Information

Opportunities for international DUNE communication and outreach Kurt Riesselmann kurtr@fnal.gov

ADVANCED DATABASE SYSTEMS Query Execution &amp; Processing @ Andy_Pavlo // 15- 721 // Spring

Abstract Generation Advanced VLSI Design CMPE 414 Abstract Generation Place and route tools do

xJS Elias Athanasopoulos, FORTH-ICS 2 xJS Elias

The Simulation Pipeline phenomenon, process etc. Scientific Computing I modelling v

Outline Background and Motivation Research Questions Serverless Application

The CMS Inner Tracker Upgrade for the HL-LHC Malte Backhaus for the CMS Collaboration

ITS LIT Mid-Year Design Review Senior Design Project 17 Department of Electrical and Computer

Computational Logic and Human Reasoning Steffen H olldobler International Center for

Modeling and Reasoning in Event Calculus Using Goal-Directed Constraint Answer Set Programming J.

Argument Strength and Probability Henry Prakken Workshop on Argument Strength Bochum

ModelPlex: Verified Runtime Validation of Verified Cyber-Physical System Models Stefan Mitsch

ADVANCED DATABASE SYSTEMS Query Execution & Processing @ Andy_Pavlo // 15- 721 // Spring