eaking br
play

eaking Br 56 nd Ba A Breakdown of High- performance - PowerPoint PPT Presentation

1 35 eaking Br 56 nd Ba A Breakdown of High- performance Communication Rohit Zambre,* Megan Grodowitz, Aparna Chandramowlishwaran,* Pavel Shamis *University of California, Irvine Arm Research 2


  1. � 1 35 eaking Br 56 nd Ba A Breakdown of High- performance Communication Rohit Zambre,* Megan Grodowitz, ⌃ Aparna Chandramowlishwaran,* Pavel Shamis ⌃ *University of California, Irvine ⌃ Arm Research

  2. � 2 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

  3. � 3 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ Evolution of the memory capacity per core in the Top500 list 
 (Peter Kogge. Pim & memory: The need for a revolution in architecture.)

  4. � 4 https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ ▸ Strong scaling is the way forward. ▸ Small messages at the limits of strong scaling. Evolution of the memory capacity per core in the Top500 list 
 (Peter Kogge. Pim & memory: The need for a revolution in architecture.)

  5. � 5 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds

  6. � 6 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds

  7. � 7 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? Injection overhead Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% 0 100 200 Nanoseconds

  8. � 8 Latency Network I/O CPU Breakdown 27.60% 37.20% 35.20% 0 500 1000 Nanoseconds ▸ How much does a component contribute? ▸ If we optimize Injection overhead component X by Y%, by how much will Misc Post_prog Post Breakdown 1.19% 22.57% 76.22% communication performance improve? 0 100 200 Nanoseconds

  9. � 9 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance.

  10. � 10 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other system configuration.

  11. � 11 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION CONTRIBUTIONS OF THE PAPER ▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other system configuration. ▸ What-if analysis for a set of optimizations. ▸ First work of its kind on an Arm-based server.

  12. � 12 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION OUTLINE ▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations

  13. � 13 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INTERNODE COMMUNICATION COMPONENTS IN HPC Examples MPICH + UCP High-level Communication Protocols (HLP) CPU UCT Low-level Communication Protocols (LLP) Root Complex + PCI Express I/O subsystem I/O NIC Mellanox InifniBand Network Switch

  14. 
 
 
 
 � 14 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP Node 
 Node 
 Mellanox 1 
 2 
 InfiniBand Mellanox 
 Lecroy 
 Mellanox 
 Network ConnectX-4 
 PCIe ConnectX-4 
 (Switch 
 NIC Analyzer NIC + 
 TX2-based TX2-based Wire) Server Server ▸ Software: MPICH CH4 + UCX; Hardware: Arm TX2 + PCIe + Mellanox IB ▸ CPU timer registers to measure CPU time. ▸ PCIe analyzer to measure time in other components through traces.

  15. � 15 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE) State-of-the-art PCIe trace viewer cooling PCIe analyzer ConnectX-4 Node 1

  16. � 16 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE) State-of-the-art PCIe trace viewer cooling PCIe analyzer ConnectX-4 Node 1

  17. � 17 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING CPU TIMERS Timer start <code> 
 <of> 
 <interest> Timer end Time for code of interest = Timer end - Timer start - Timer overhead

  18. � 18 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING CPU TIMERS MPI_Isend MPI ucp_tag_send_nb UCP uct_ep_am_short UCT ▸ Measured time in different components using deltas. ▸ Carefully isolated callbacks/functions between layers (details in paper).

  19. � 19 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING PCIE ANALYZER Time of event = Timestamp of packet after event - 
 Timestamp of packet before event

  20. � 20 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION USING PCIE ANALYZER NIC WRITING COMPLETION TLP 
 N 
 Root 
 Analyzer MWr 2 ✕ I 
 Complex 
 PCIe 
 DLLP 
 C (RC) wire ACK

  21. � 21 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION OUTLINE ▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations

  22. � 22 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD

  23. � 23 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND Sender Programmed 
 IO 
 CPU Post Root 
 N 
 Complex 
 I 
 (RC) C MEM

  24. 
 
 � 24 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND Sender Programmed 
 IO 
 CPU Post MWr (64B) Root 
 N 
 PCIe wire Complex 
 I 
 (RC) C MEM

  25. 
 
 � 25 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed 
 IO 
 CPU Post MWr (64B) Transmit Root 
 N 
 PCIe wire Complex 
 I 
 (RC) C MEM ACK

  26. 
 
 � 26 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed 
 IO 
 CPU Post MWr (64B) Transmit Root 
 N 
 PCIe wire Complex 
 I 
 (RC) C Write MEM completion MWr (64B) ACK

  27. 
 
 � 27 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD: BACKGROUND INJECTION OVERHEAD Sender Programmed 
 IO 
 CPU Post MWr (64B) Transmit Root 
 N 
 PCIe wire Complex 
 I 
 (RC) C Write MEM Completion 
 completion DMA-write MWr (64B) ACK

  28. 
 
 � 28 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD INJECTION OVERHEAD: BACKGROUND Sender Programmed 
 IO 
 CPU Post MWr (64B) Transmit Root 
 N 
 PCIe wire Complex 
 I 
 Progress (RC) C Write MEM Completion 
 completion DMA-write MWr (64B) ACK

  29. 
 
 � 29 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed 
 IO 
 CPU Post MWr (64B) Transmit Root 
 N 
 PCIe wire Complex 
 I 
 Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion 
 completion DMA-write MWr (64B) ACK

  30. 
 
 � 30 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed 
 b ✕ Post + b ✕ Progress + tot_Misc IO 
 CPU Post MWr (64B) Transmit b Root 
 N 
 PCIe wire = CPU_time = Post + Progress + Misc Complex 
 I 
 Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion 
 completion DMA-write MWr (64B) ACK

  31. 
 
 � 31 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed 
 b ✕ Post + b ✕ Progress + tot_Misc IO 
 CPU Post MWr (64B) Transmit b Root 
 N 
 PCIe wire = CPU_time = Post + Progress + Misc Complex 
 I 
 Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion 
 completion DMA-write MWr (64B) ACK (1) Credit-based flow control 
 (2) Multiple outstanding PCIe transactions

  32. 
 
 � 32 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD ▸ Overhead observed by RC Sender Programmed 
 b ✕ Post + b ✕ Progress + tot_Misc IO 
 CPU Post MWr (64B) Transmit b Root 
 N 
 PCIe wire = CPU_time = Post + Progress + Misc Complex 
 I 
 Progress (RC) C ▸ Overhead observed by NIC Write MEM Completion 
 completion DMA-write MWr (64B) ACK = Overhead observed by RC (1) Credit-based flow control 
 (2) Multiple outstanding PCIe transactions

  33. � 33 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION INJECTION OVERHEAD Injection overhead = CPU_time = Post + Progress + Misc CPU timer registers

  34. � 34 BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION Misc Progress Post Post is performance Progress is semantic 1.20% 22.58% 76.23% bottleneck bottleneck 0 25 50 75 100 Percent

Recommend


More recommend