Rohit Zambre,* Megan Grodowitz,⌃ Aparna Chandramowlishwaran,* Pavel Shamis⌃ *University of California, Irvine
⌃Arm Research
Br Ba eaking nd
35 56
A Breakdown of High- performance Communication
1
eaking Br 56 nd Ba A Breakdown of High- performance - - PowerPoint PPT Presentation
1 35 eaking Br 56 nd Ba A Breakdown of High- performance Communication Rohit Zambre,* Megan Grodowitz, Aparna Chandramowlishwaran,* Pavel Shamis *University of California, Irvine Arm Research 2
Rohit Zambre,* Megan Grodowitz,⌃ Aparna Chandramowlishwaran,* Pavel Shamis⌃ *University of California, Irvine
⌃Arm Research
35 56
A Breakdown of High- performance Communication
1
https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/
2
https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ Evolution of the memory capacity per core in the Top500 list (Peter Kogge. Pim & memory: The need for a revolution in architecture.)
3
https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ Evolution of the memory capacity per core in the Top500 list (Peter Kogge. Pim & memory: The need for a revolution in architecture.)
▸ Strong scaling is the way forward. ▸ Small messages at the limits of strong scaling.
4
Network 27.60% I/O 37.20% CPU 35.20%
Breakdown Latency 500 1000 Nanoseconds
Misc 1.19% Post_prog 22.57% Post 76.22%
Breakdown Injection overhead 100 200 Nanoseconds
5
Network 27.60% I/O 37.20% CPU 35.20%
Breakdown Latency 500 1000 Nanoseconds
Misc 1.19% Post_prog 22.57% Post 76.22%
Breakdown Injection overhead 100 200 Nanoseconds
▸ How much does a
component contribute?
6
Network 27.60% I/O 37.20% CPU 35.20%
Breakdown Latency 500 1000 Nanoseconds
Misc 1.19% Post_prog 22.57% Post 76.22%
Breakdown Injection overhead 100 200 Nanoseconds
▸ How much does a
component contribute?
7
Network 27.60% I/O 37.20% CPU 35.20%
Breakdown Latency 500 1000 Nanoseconds
Misc 1.19% Post_prog 22.57% Post 76.22%
Breakdown Injection overhead 100 200 Nanoseconds
▸ How much does a
component contribute?
▸ If we optimize
component X by Y%, by how much will communication performance improve?
8
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
CONTRIBUTIONS OF THE PAPER
▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance.
9
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other
system configuration.
CONTRIBUTIONS OF THE PAPER
10
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other
system configuration.
▸ What-if analysis for a set of optimizations. ▸ First work of its kind on an Arm-based server.
CONTRIBUTIONS OF THE PAPER
11
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
OUTLINE
▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations
12
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
INTERNODE COMMUNICATION COMPONENTS IN HPC
High-level Communication Protocols (HLP) Low-level Communication Protocols (LLP) I/O subsystem NIC Switch
MPICH + UCP UCT Root Complex + PCI Express Mellanox InifniBand Examples
CPU I/O Network
13
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
EXPERIMENTAL SETUP
▸ Software: MPICH CH4 + UCX; Hardware: Arm TX2 + PCIe + Mellanox IB ▸ CPU timer registers to measure CPU time. ▸ PCIe analyzer to measure time in other components through traces.
Node 1 TX2-based Server Mellanox ConnectX-4 NIC Lecroy PCIe Analyzer Node 2 TX2-based Server Mellanox ConnectX-4 NIC Mellanox InfiniBand Network (Switch + Wire)
14
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE)
PCIe analyzer PCIe trace viewer ConnectX-4 Node 1 State-of-the-art cooling
15
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE)
PCIe analyzer PCIe trace viewer ConnectX-4 State-of-the-art cooling Node 1
16
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
USING CPU TIMERS
17
Timer start Timer end
<code> <of> <interest>
Time for code of interest = Timer end - Timer start - Timer overhead
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
USING CPU TIMERS
18
MPI_Isend ucp_tag_send_nb uct_ep_am_short
MPI UCP UCT
▸ Measured time in different components using deltas. ▸ Carefully isolated callbacks/functions between layers (details in paper).
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
USING PCIE ANALYZER
19
Time of event = Timestamp of packet after event - Timestamp of packet before event
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
TLP MWr
N I C Root Complex (RC) Analyzer
DLLP ACK 2 ✕ PCIe wire
NIC WRITING COMPLETION
20
USING PCIE ANALYZER
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
OUTLINE
▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations
21
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
INJECTION OVERHEAD
22
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
Sender
N I C Root Complex (RC)
23
INJECTION OVERHEAD: BACKGROUND
Programmed IO Post
CPU MEM
Sender
Programmed IO Post
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
MWr (64B)
N I C Root Complex (RC) CPU MEM
24
PCIe wire
INJECTION OVERHEAD: BACKGROUND
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
Sender
Transmit ACK
N I C Root Complex (RC)
25
INJECTION OVERHEAD
Programmed IO Post MWr (64B)
CPU MEM
PCIe wire
INJECTION OVERHEAD: BACKGROUND
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
Sender
Transmit ACK MWr (64B) Write completion
N I C Root Complex (RC)
26
INJECTION OVERHEAD
Programmed IO Post MWr (64B)
CPU MEM
PCIe wire
INJECTION OVERHEAD: BACKGROUND
Sender
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
MWr (64B) Transmit ACK Write completion Completion DMA-write
N I C Root Complex (RC)
27
INJECTION OVERHEAD
Programmed IO Post MWr (64B)
CPU MEM
PCIe wire
INJECTION OVERHEAD: BACKGROUND
Sender
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
MWr (64B) Progress Transmit ACK Write completion
N I C Root Complex (RC)
28
INJECTION OVERHEAD
Completion DMA-write Programmed IO Post MWr (64B)
CPU MEM
PCIe wire
INJECTION OVERHEAD: BACKGROUND
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
▸ Overhead observed by RC ▸ Overhead observed by NIC
Sender
MWr (64B) Transmit ACK Write completion
N I C Root Complex (RC)
29
INJECTION OVERHEAD
Progress Completion DMA-write Programmed IO Post MWr (64B)
CPU MEM
PCIe wire
Sender
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
MWr (64B) Transmit ACK Write completion PCIe wire
▸ Overhead observed by RC ▸ Overhead observed by NIC
b ✕ Post + b ✕ Progress + tot_Misc b
N I C
= CPU_time = Post + Progress + Misc
Root Complex (RC)
30
INJECTION OVERHEAD
Progress Completion DMA-write Programmed IO Post MWr (64B)
CPU MEM
Sender
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
MWr (64B) Transmit ACK Write completion PCIe wire
▸ Overhead observed by RC ▸ Overhead observed by NIC
N I C
b ✕ Post + b ✕ Progress + tot_Misc b = CPU_time = Post + Progress + Misc
Root Complex (RC)
31
INJECTION OVERHEAD
Progress Completion DMA-write Programmed IO Post
CPU MEM
MWr (64B)
(1) Credit-based flow control (2) Multiple outstanding PCIe transactions
Sender
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
MWr (64B) Transmit ACK Write completion PCIe wire
▸ Overhead observed by RC ▸ Overhead observed by NIC
N I C
b ✕ Post + b ✕ Progress + tot_Misc b = CPU_time = Post + Progress + Misc = Overhead observed by RC
Root Complex (RC)
32
INJECTION OVERHEAD
Progress Completion DMA-write Programmed IO Post
CPU MEM
MWr (64B)
(1) Credit-based flow control (2) Multiple outstanding PCIe transactions
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
Injection overhead = CPU_time = Post + Progress + Misc
33
INJECTION OVERHEAD
CPU timer registers
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
1.20% 22.58% 76.23%
25 50 75 100 Percent Misc Progress Post
Post is performance bottleneck Progress is semantic bottleneck
34
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
1.61% 98.39% 86.85% 13.15%
Progress Post 25 50 75 100 Percent Communication protocol LLP HLP
LLP is bottleneck in Post HLP is bottleneck in Progress
1.20% 22.58% 76.23%
25 50 75 100 Percent Misc Progress Post
Post is performance bottleneck Progress is semantic bottleneck
35
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
1.61% 98.39% 86.85% 13.15%
Progress Post 25 50 75 100 Percent Communication protocol LLP HLP
LLP is bottleneck in Post HLP is bottleneck in Progress
15.84% 9.88% 12.01% 53.79% 8.49%
25 50 75 100 Percent MD setup Barrier for MD Barrier for DBC PIO copy Other
1.20% 22.58% 76.23%
25 50 75 100 Percent Misc Progress Post
Post is performance bottleneck Progress is semantic bottleneck PIO copy bottleneck in LLP’s Post
36
LLP’s Post
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
OUTLINE
▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Breakdown ▸ Simulated optimizations
37
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
LATENCY OVERHEAD: MODELING
Post
CPU Root Complex (RC)
PCIe
NIC NIC Root Complex (RC) MEM Switch
PCIe RC- to- MEM
Latency = Post + 2 (PCIe) + Network + RC-to-MEM + Progress
CPU
Prog- ress Network = Wire + Switch
CPU timer-registers CPU timer-registers System measurements
38
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
Network 27.60% I/O 37.20% CPU 35.20% LLP 48.55% HLP 51.45% RC-to-MEM 46.70% PCIe 53.30% Wire 71.79% Switch 28.21%
25 50 75 100 End-to-end latency CPU I/O Network Percent
Target 66.20% Initiator 33.80% I/O 40.50% CPU 59.50% I/O 56.93% CPU 43.07% RC-to-MEM 63.67% PCIe 36.33%
25 50 75 100 On-node Initiator Target Target I/O Percent
Wire dominates
On-node time dominates latency Target dominates
CPU is the majority on initiator due to PIO I/O is the majority on target due to RC-to-MEM
39
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
OUTLINE
▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations
40
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
SIMULATED OPTIMIZATIONS
▸ If we optimize component X by Y%, what is the corresponding speedup in
latency and injection overhead?
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 10% 30% 50% 70% 90% Overhead reduction Speedup
41
42
0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 10% 30% 50% 70% 90% Overhead reduction Latency speedup Integrated NIC PCIe RC-to-MEM
▸ Would eliminate most of I/O. ▸ Would make the CPU more available. ▸ Likelihood: Likely to become
commonplace
▸ Modest 50% reduction can
speedup latency by 15%
43
▸ Microarchitectural improvements for
writes to device memory most impactful.
▸ Likelihood: Likely since there seems to
be room for improvement
▸ PIO reduction to 15ns (84%
reduction) can speedup injection by 25%
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 10% 30% 50% 70% 90% Overhead reduction Injection speedup HLP LLP LLP_post PIO HLP_tx_prog HLP_post LLP_tx_prog
44
▸ HLP progress improvements would
be closest to upper bounds.
▸ Likelihood: Overhead reductions not
likely more than 20%.
▸ Less than 5% latency speedup ▸ 6.44% injection speedup
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 10% 30% 50% 70% 90% Overhead reduction Injection speedup HLP LLP LLP_post PIO HLP_tx_prog HLP_post LLP_tx_prog 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 10% 30% 50% 70% 90% Overhead reduction Latency speedup HLP LLP HLP_rx_prog LLP_post PIO HLP_post LLP_prog
45
▸ Likelihood: Further overhead
reductions unlikely.
▸ Wire latencies expected to
increase.
▸ Gen-Z switch overheads yet to be
demonstrated.
0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 10% 30% 50% 70% 90% Overhead reduction Latency speedup Wire Switch
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
SUMMARY
▸ Our models explain observed performance within 5% margin of error. ▸ Breakdown explains where, why, and how much time is spent, providing key
insights.
▸ Breakdown would help researchers guide their optimization efforts.
Special thanks to Giri Chukkapalli, and Ham Prince from Marvell Technology Group, Yossi Itigin from Mellanox Technologies, and Pavan Balaji from Argonne National Laboratory
46
47
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
ACK
N I C Analyzer
Transmit Network ✕ 2
NIC RECEIVING ACK FROM TARGET NIC
MWr(64) MWr(64)
48
USING PCIE ANALYZER
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
BREAKDOWN OF THE HIGHER LEVEL
▸ Measured initiation components using deltas.
High-level Communication Protocols (LLP)
HLP_post, HLP_progress CPU
MPI_Isend ucp_tag_send_nb uct_ep_am_short
MPI UCP
49
BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION
RELEVANT RESEARCH
▸ Most of the prior research tackle one component and show effect on overall
performance.
▸ Papadopoulou et al., Raffeneti et al. show instruction breakdown on UCX and
MPICH respectively.
▸ Ajima et al. show breakdown of RDMA-write latency on post-K using
simulation waveforms.
50