eaking Br 56 nd Ba A Breakdown of High- performance - - PowerPoint PPT Presentation

eaking br
SMART_READER_LITE
LIVE PREVIEW

eaking Br 56 nd Ba A Breakdown of High- performance - - PowerPoint PPT Presentation

1 35 eaking Br 56 nd Ba A Breakdown of High- performance Communication Rohit Zambre,* Megan Grodowitz, Aparna Chandramowlishwaran,* Pavel Shamis *University of California, Irvine Arm Research 2


slide-1
SLIDE 1

Rohit Zambre,* Megan Grodowitz,⌃ Aparna Chandramowlishwaran,* Pavel Shamis⌃ *University of California, Irvine

⌃Arm Research

Br Ba eaking nd

35 56

A Breakdown of High- performance Communication

1

slide-2
SLIDE 2

https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

2

slide-3
SLIDE 3

https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ Evolution of the memory capacity per core in the Top500 list
 (Peter Kogge. Pim & memory: The need for a revolution in architecture.)

3

slide-4
SLIDE 4

https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ Evolution of the memory capacity per core in the Top500 list
 (Peter Kogge. Pim & memory: The need for a revolution in architecture.)

▸ Strong scaling is the way forward. ▸ Small messages at the limits of strong scaling.

4

slide-5
SLIDE 5

Network 27.60% I/O 37.20% CPU 35.20%

Breakdown Latency 500 1000 Nanoseconds

Misc 1.19% Post_prog 22.57% Post 76.22%

Breakdown Injection overhead 100 200 Nanoseconds

5

slide-6
SLIDE 6

Network 27.60% I/O 37.20% CPU 35.20%

Breakdown Latency 500 1000 Nanoseconds

Misc 1.19% Post_prog 22.57% Post 76.22%

Breakdown Injection overhead 100 200 Nanoseconds

▸ How much does a

component contribute?

6

slide-7
SLIDE 7

Network 27.60% I/O 37.20% CPU 35.20%

Breakdown Latency 500 1000 Nanoseconds

Misc 1.19% Post_prog 22.57% Post 76.22%

Breakdown Injection overhead 100 200 Nanoseconds

▸ How much does a

component contribute?

7

slide-8
SLIDE 8

Network 27.60% I/O 37.20% CPU 35.20%

Breakdown Latency 500 1000 Nanoseconds

Misc 1.19% Post_prog 22.57% Post 76.22%

Breakdown Injection overhead 100 200 Nanoseconds

▸ How much does a

component contribute?

▸ If we optimize

component X by Y%, by how much will communication performance improve?

8

slide-9
SLIDE 9

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

CONTRIBUTIONS OF THE PAPER

▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance.

9

slide-10
SLIDE 10

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other

system configuration.

CONTRIBUTIONS OF THE PAPER

10

slide-11
SLIDE 11

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

▸ A detailed breakdown of communication performance of small messages. ▸ Analytical models to explain the injection and latency. ▸ Effective within 5% of the observed performance. ▸ Detailed measurement methodology to produce breakdown on any other

system configuration.

▸ What-if analysis for a set of optimizations. ▸ First work of its kind on an Arm-based server.

CONTRIBUTIONS OF THE PAPER

11

slide-12
SLIDE 12

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

OUTLINE

▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations

12

slide-13
SLIDE 13

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

INTERNODE COMMUNICATION COMPONENTS IN HPC

High-level Communication Protocols (HLP) Low-level Communication Protocols (LLP) I/O subsystem NIC Switch

MPICH + UCP UCT Root Complex + PCI Express Mellanox InifniBand Examples

CPU I/O Network

13

slide-14
SLIDE 14

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

EXPERIMENTAL SETUP

▸ Software: MPICH CH4 + UCX; Hardware: Arm TX2 + PCIe + Mellanox IB ▸ CPU timer registers to measure CPU time. ▸ PCIe analyzer to measure time in other components through traces.

Node
 1
 
 
 TX2-based Server Mellanox
 ConnectX-4
 NIC Lecroy
 PCIe Analyzer Node
 2
 
 
 TX2-based Server Mellanox
 ConnectX-4
 NIC Mellanox InfiniBand Network (Switch
 +
 Wire)

14

slide-15
SLIDE 15

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE)

PCIe analyzer PCIe trace viewer ConnectX-4 Node 1 State-of-the-art cooling

15

slide-16
SLIDE 16

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

EXPERIMENTAL SETUP (WHAT IT ACTUALLY LOOKED LIKE)

PCIe analyzer PCIe trace viewer ConnectX-4 State-of-the-art cooling Node 1

16

slide-17
SLIDE 17

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

USING CPU TIMERS

17

Timer start Timer end

<code>
 <of>
 <interest>

Time for code of interest = Timer end - Timer start - Timer overhead

slide-18
SLIDE 18

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

USING CPU TIMERS

18

MPI_Isend ucp_tag_send_nb uct_ep_am_short

MPI UCP UCT

▸ Measured time in different components using deltas. ▸ Carefully isolated callbacks/functions between layers (details in paper).

slide-19
SLIDE 19

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

USING PCIE ANALYZER

19

Time of event = Timestamp of packet after event -
 Timestamp of packet before event

slide-20
SLIDE 20

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

TLP
 MWr

N
 I
 C Root
 Complex
 (RC) Analyzer

DLLP
 ACK 2 ✕ PCIe
 wire

NIC WRITING COMPLETION

20

USING PCIE ANALYZER

slide-21
SLIDE 21

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

OUTLINE

▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations

21

slide-22
SLIDE 22

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

INJECTION OVERHEAD

22

slide-23
SLIDE 23

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

Sender

N
 I
 C Root
 Complex
 (RC)

23

INJECTION OVERHEAD: BACKGROUND

Programmed
 IO
 Post

CPU MEM

slide-24
SLIDE 24

Sender

Programmed
 IO
 Post

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION


 
 MWr (64B)

N
 I
 C Root
 Complex
 (RC) CPU MEM

24

PCIe wire

INJECTION OVERHEAD: BACKGROUND

slide-25
SLIDE 25

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

Sender

Transmit ACK

N
 I
 C Root
 Complex
 (RC)

25

INJECTION OVERHEAD

Programmed
 IO
 Post 
 
 MWr (64B)

CPU MEM

PCIe wire

INJECTION OVERHEAD: BACKGROUND

slide-26
SLIDE 26

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

Sender

Transmit ACK MWr (64B) Write completion

N
 I
 C Root
 Complex
 (RC)

26

INJECTION OVERHEAD

Programmed
 IO
 Post 
 
 MWr (64B)

CPU MEM

PCIe wire

INJECTION OVERHEAD: BACKGROUND

slide-27
SLIDE 27

Sender

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

MWr (64B) Transmit ACK Write completion Completion
 DMA-write

N
 I
 C Root
 Complex
 (RC)

27

INJECTION OVERHEAD

Programmed
 IO
 Post 
 
 MWr (64B)

CPU MEM

PCIe wire

INJECTION OVERHEAD: BACKGROUND

slide-28
SLIDE 28

Sender

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

MWr (64B) Progress Transmit ACK Write completion

N
 I
 C Root
 Complex
 (RC)

28

INJECTION OVERHEAD

Completion
 DMA-write Programmed
 IO
 Post 
 
 MWr (64B)

CPU MEM

PCIe wire

INJECTION OVERHEAD: BACKGROUND

slide-29
SLIDE 29

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

▸ Overhead observed by RC ▸ Overhead observed by NIC

Sender

MWr (64B) Transmit ACK Write completion

N
 I
 C Root
 Complex
 (RC)

29

INJECTION OVERHEAD

Progress Completion
 DMA-write Programmed
 IO
 Post 
 
 MWr (64B)

CPU MEM

PCIe wire

slide-30
SLIDE 30

Sender

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

MWr (64B) Transmit ACK Write completion PCIe wire

▸ Overhead observed by RC ▸ Overhead observed by NIC

b ✕ Post + b ✕ Progress + tot_Misc b

N
 I
 C

= CPU_time = Post + Progress + Misc

Root
 Complex
 (RC)

30

INJECTION OVERHEAD

Progress Completion
 DMA-write Programmed
 IO
 Post 
 
 MWr (64B)

CPU MEM

slide-31
SLIDE 31

Sender

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

MWr (64B) Transmit ACK Write completion PCIe wire

▸ Overhead observed by RC ▸ Overhead observed by NIC

N
 I
 C

b ✕ Post + b ✕ Progress + tot_Misc b = CPU_time = Post + Progress + Misc

Root
 Complex
 (RC)

31

INJECTION OVERHEAD

Progress Completion
 DMA-write Programmed
 IO
 Post

CPU MEM


 
 MWr (64B)

(1) Credit-based flow control
 (2) Multiple outstanding PCIe transactions

slide-32
SLIDE 32

Sender

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

MWr (64B) Transmit ACK Write completion PCIe wire

▸ Overhead observed by RC ▸ Overhead observed by NIC

N
 I
 C

b ✕ Post + b ✕ Progress + tot_Misc b = CPU_time = Post + Progress + Misc = Overhead observed by RC

Root
 Complex
 (RC)

32

INJECTION OVERHEAD

Progress Completion
 DMA-write Programmed
 IO
 Post

CPU MEM


 
 MWr (64B)

(1) Credit-based flow control
 (2) Multiple outstanding PCIe transactions

slide-33
SLIDE 33

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

Injection overhead = CPU_time = Post + Progress + Misc

33

INJECTION OVERHEAD

CPU timer registers

slide-34
SLIDE 34

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

1.20% 22.58% 76.23%

25 50 75 100 Percent Misc Progress Post

Post is performance bottleneck Progress is semantic bottleneck

34

slide-35
SLIDE 35

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

1.61% 98.39% 86.85% 13.15%

Progress Post 25 50 75 100 Percent Communication protocol LLP HLP

LLP is bottleneck in Post HLP is bottleneck in Progress

1.20% 22.58% 76.23%

25 50 75 100 Percent Misc Progress Post

Post is performance bottleneck Progress is semantic bottleneck

35

slide-36
SLIDE 36

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

1.61% 98.39% 86.85% 13.15%

Progress Post 25 50 75 100 Percent Communication protocol LLP HLP

LLP is bottleneck in Post HLP is bottleneck in Progress

15.84% 9.88% 12.01% 53.79% 8.49%

25 50 75 100 Percent MD setup Barrier for MD Barrier for DBC PIO copy Other

1.20% 22.58% 76.23%

25 50 75 100 Percent Misc Progress Post

Post is performance bottleneck Progress is semantic bottleneck PIO copy bottleneck in LLP’s Post

36

LLP’s Post

slide-37
SLIDE 37

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

OUTLINE

▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Breakdown ▸ Simulated optimizations

37

slide-38
SLIDE 38

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

LATENCY OVERHEAD: MODELING

Post

CPU Root
 Complex
 (RC)

PCIe

NIC NIC Root
 Complex
 (RC) MEM Switch

PCIe RC-
 to-
 MEM

Latency = Post + 2 (PCIe) + Network + RC-to-MEM + Progress

CPU

Prog-
 ress Network
 = Wire + Switch

CPU
 timer-registers CPU
 timer-registers System measurements

38

slide-39
SLIDE 39

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

Network 27.60% I/O 37.20% CPU 35.20% LLP 48.55% HLP 51.45% RC-to-MEM 46.70% PCIe 53.30% Wire 71.79% Switch 28.21%

25 50 75 100 End-to-end latency CPU I/O Network Percent

Target 66.20% Initiator 33.80% I/O 40.50% CPU 59.50% I/O 56.93% CPU 43.07% RC-to-MEM 63.67% PCIe 36.33%

25 50 75 100 On-node Initiator Target Target I/O Percent

Wire dominates

  • ff-node time

On-node time dominates latency Target dominates

  • n-node time

CPU is the majority on initiator due to PIO I/O is the majority on target due to RC-to-MEM

39

slide-40
SLIDE 40

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

OUTLINE

▸ Introduction ▸ Experimental setup & Measurement methodology ▸ Injection overhead: Modeling and breakdown ▸ Latency: Modeling and breakdown ▸ Simulated optimizations

40

slide-41
SLIDE 41

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

SIMULATED OPTIMIZATIONS

▸ If we optimize component X by Y%, what is the corresponding speedup in

latency and injection overhead?

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 10% 30% 50% 70% 90% Overhead reduction Speedup

41

slide-42
SLIDE 42

NIC INTEGRATED ON CHIP

42

0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 10% 30% 50% 70% 90% Overhead reduction Latency speedup Integrated NIC PCIe RC-to-MEM

▸ Would eliminate most of I/O. ▸ Would make the CPU more available. ▸ Likelihood: Likely to become

commonplace

▸ Modest 50% reduction can

speedup latency by 15%

slide-43
SLIDE 43


 FASTER LLP

43

▸ Microarchitectural improvements for

writes to device memory most impactful.

▸ Likelihood: Likely since there seems to

be room for improvement

▸ PIO reduction to 15ns (84%

reduction) can speedup injection by 25%

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 10% 30% 50% 70% 90% Overhead reduction Injection speedup HLP LLP LLP_post PIO HLP_tx_prog HLP_post LLP_tx_prog

slide-44
SLIDE 44

HLP SOFTWARE IMPROVEMENTS

44

▸ HLP progress improvements would

be closest to upper bounds.

▸ Likelihood: Overhead reductions not

likely more than 20%.

▸ Less than 5% latency speedup ▸ 6.44% injection speedup

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 10% 30% 50% 70% 90% Overhead reduction Injection speedup HLP LLP LLP_post PIO HLP_tx_prog HLP_post LLP_tx_prog 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 10% 30% 50% 70% 90% Overhead reduction Latency speedup HLP LLP HLP_rx_prog LLP_post PIO HLP_post LLP_prog

slide-45
SLIDE 45

NETWORK IMPROVEMENTS

45

▸ Likelihood: Further overhead

reductions unlikely.

▸ Wire latencies expected to

increase.

▸ Gen-Z switch overheads yet to be

demonstrated.

0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 10% 30% 50% 70% 90% Overhead reduction Latency speedup Wire Switch

slide-46
SLIDE 46

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

SUMMARY

▸ Our models explain observed performance within 5% margin of error. ▸ Breakdown explains where, why, and how much time is spent, providing key

insights.

▸ Breakdown would help researchers guide their optimization efforts.

“To measure is to know.” — Lord Kelvin

Special thanks to
 Giri Chukkapalli, and Ham Prince from Marvell Technology Group,
 Yossi Itigin from Mellanox Technologies, and
 Pavan Balaji from Argonne National Laboratory

46

slide-47
SLIDE 47

47

slide-48
SLIDE 48

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

ACK

N
 I
 C Analyzer

Transmit Network ✕ 2

NIC RECEIVING ACK FROM TARGET NIC

MWr(64) MWr(64)

48

USING PCIE ANALYZER

slide-49
SLIDE 49

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

BREAKDOWN OF THE HIGHER LEVEL

▸ Measured initiation components using deltas.

High-level Communication Protocols (LLP)

HLP_post, HLP_progress CPU

MPI_Isend ucp_tag_send_nb uct_ep_am_short

MPI UCP

49

slide-50
SLIDE 50

BREAKING BAND: A BREAKDOWN OF HIGH-PERFORMANCE COMMUNICATION

RELEVANT RESEARCH

▸ Most of the prior research tackle one component and show effect on overall

performance.

▸ Papadopoulou et al., Raffeneti et al. show instruction breakdown on UCX and

MPICH respectively.

▸ Ajima et al. show breakdown of RDMA-write latency on post-K using

simulation waveforms.

50