� 1 35 eaking Br 56 nd Ba A Breakdown of High- performance Communication Rohit Zambre,* Megan Grodowitz, ⌃ Aparna Chandramowlishwaran,* Pavel Shamis ⌃ *University of California, Irvine ⌃ Arm Research
� 2 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/
� 3 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ Evolution of the memory capacity per core in the Top500 list (Peter Kogge. Pim & memory: The need for a revolution in architecture.)
� 4 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ ▸ Strong scaling is the way forward. ▸ Small messages at the limits of strong scaling. Evolution of the memory capacity per core in the Top500 list (Peter Kogge. Pim & memory: The need for a revolution in architecture.)
� 5 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds
� 6 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds
� 7 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds
� 8 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? ▸ If we optimize Injection overhead component X by Y%, by how much will Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% communication performance improve? 0 100 200 Nanoseconds
� 9 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance.
� 10 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other system configuration.
� 11 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other system configuration. ▸ What-if analysis for a set of optimizations. ▸ First work of its kind on an Arm-based server.
� 12 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION OUTLINE ▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations
� 13 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INTERNODE COMMUNICATION COMPONENTS IN HPC Examples MPICH + UCP High-level Communication Protocols (HLP) CPU UCT Low-level Communication Protocols (LLP) Root Complex + PCI Express I/O subsystem I/O NIC Mellanox InifniBand Network Switch
� 14 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP Node Node Mellanox 1 2 InfiniBand Mellanox Lecroy Mellanox Network ConnectX-4 PCIe ConnectX-4 (Switch NIC Analyzer NIC + TX2-based TX2-based Wire) Server Server ▸ Software: MPICH CH4 + UCX; Hardware: Arm TX2 + PCIe + Mellanox IB ▸ CPU timer registers to measure CPU time. ▸ PCIe analyzer to measure time in other components through traces.
� 15 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE) State-of-the-art PCIe trace viewer cooling PCIe analyzer ConnectX-4 Node 1
� 16 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE) State-of-the-art PCIe trace viewer cooling PCIe analyzer ConnectX-4 Node 1
� 17 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING CPU TIMERS Timer start <code> <of> <interest> Timer end Time for code of interest = Timer end - Timer start - Timer overhead
� 18 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING CPU TIMERS MPI_Isend MPI ucp_tag_send_nb UCP uct_ep_am_short UCT ▸ Measured time in different components using deltas. ▸ Carefully isolated callbacks/functions between layers (details in paper).
� 19 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING PCIE ANALYZER Time of event = Timestamp of packet after event - Timestamp of packet before event
� 20 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING PCIE ANALYZER NIC WRITING COMPLETION TLP N Root Analyzer MWr 2 ✕ I Complex PCIe DLLP C (RC) wire ACK
� 21 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION OUTLINE ▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations
� 22 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD
� 23 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND Sender Programmed IO CPU Post Root N Complex I (RC) C MEM
� 24 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND Sender Programmed IO CPU Post MWr (64B) Root N PCIe wire Complex I (RC) C MEM
� 25 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed IO CPU Post MWr (64B) Transmit Root N PCIe wire Complex I (RC) C MEM ACK
� 26 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed IO CPU Post MWr (64B) Transmit Root N PCIe wire Complex I (RC) C Write MEM completion MWr (64B) ACK
� 27 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND INJECTION OVERHEAD Sender Programmed IO CPU Post MWr (64B) Transmit Root N PCIe wire Complex I (RC) C Write MEM Completion completion DMA-write MWr (64B) ACK
� 28 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed IO CPU Post MWr (64B) Transmit Root N PCIe wire Complex I Progress (RC) C Write MEM Completion completion DMA-write MWr (64B) ACK
� 29 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed IO CPU Post MWr (64B) Transmit Root N PCIe wire Complex I Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion completion DMA-write MWr (64B) ACK
� 30 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed b ✕ Post + b ✕ Progress + tot_Misc IO CPU Post MWr (64B) Transmit b Root N PCIe wire = CPU_time = Post + Progress + Misc Complex I Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion completion DMA-write MWr (64B) ACK
� 31 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed b ✕ Post + b ✕ Progress + tot_Misc IO CPU Post MWr (64B) Transmit b Root N PCIe wire = CPU_time = Post + Progress + Misc Complex I Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion completion DMA-write MWr (64B) ACK (1) Credit-based flow control (2) Multiple outstanding PCIe transactions
� 32 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed b ✕ Post + b ✕ Progress + tot_Misc IO CPU Post MWr (64B) Transmit b Root N PCIe wire = CPU_time = Post + Progress + Misc Complex I Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion completion DMA-write MWr (64B) ACK = Overhead observed by RC (1) Credit-based flow control (2) Multiple outstanding PCIe transactions
� 33 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD Injection overhead = CPU_time = Post + Progress + Misc CPU timer registers
� 34 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION Misc Progress Post Post is performance Progress is semantic 1.20% 22.58% 76.23% bottleneck bottleneck 0 25 50 75 100 Percent
Recommend
More recommend