Basic Communication Operations Possible variants # of nodes - - PowerPoint PPT Presentation

basic communication operations
SMART_READER_LITE
LIVE PREVIEW

Basic Communication Operations Possible variants # of nodes - - PowerPoint PPT Presentation

Basic Communication Operations Possible variants # of nodes involved Point-to-point vs collective operation routing scheme Store-and-Forward (S&F), Cut-Through (CT) and Packet Routing Usually point-to-point


slide-1
SLIDE 1

Basic Communication Operations

  • Possible variants

– # of nodes involved

  • Point-to-point vs collective operation

– routing scheme

  • Store-and-Forward (S&F), Cut-Through (CT) and Packet

Routing

  • Usually point-to-point implemented in hardware,

collective in software

  • Many of the collective have a dual operation

– the dual can be performed reversing the direction and sequence of messages in the original operation

slide-2
SLIDE 2
  • Store-and-forward => tcomm≈ ts + lmtw

– ring

  • l = ⎣ p/2⎦
  • tcomm = ts + ⎣ p/2⎦ mtw

– mesh

  • l = 2 ⎣√p/2⎦
  • tcomm = ts + 2 ⎣√p/2⎦ mtw

– hypercube

  • l = log p
  • tcomm = ts + mtw log p
  • Cut-through (or Packet)=> tcomm = ts + lth + mtw

– Small messages: CT ≈ S&F ≈ ts + lth – Large messages: CT ≈ ts + mtw(no dependence from l)

Point-to-point

slide-3
SLIDE 3

One-to-all broadcast

  • A.k.a single-node broadcast

– message of size m on source processor – at the end of the operation message is replicated on all other procs

  • Dual operation: single-node accumulation (a.k.a reduce operation)

– initially every processor has message of size m – at the end, combination of all messages is on single destination proc – combination is through an associative operation (sum, product, max, min)

slide-4
SLIDE 4

Broadcast over mesh: example

  • Multiplication of 4 x 4 matrix with a 4 x 1 vector
slide-5
SLIDE 5

Broadcast on ring (S&F)

1 2 3 7 6 5 4

1 2 2 3 3 4 4

  • Number of steps: ⎡p/2⎤
  • Latency of communication step: ts + mtw
  • Total duration: Tone_to_all = (ts + mtw) ⎡p/2⎤
slide-6
SLIDE 6

Broadcast on mesh (S&F)

1 2 2 3 3 3 3 4 4 4 4 4 4 4 4

  • Row/column broadcast

time:

– (ts + mtw) ⎡√p/2⎤

  • Total duration:

– Tone_to_all = 2(ts + mtw) ⎡√p/2⎤

  • 3D mesh

– Tone_to_all = 3(ts + mtw) ⎡p1/3/2⎤

slide-7
SLIDE 7

Broadcast on hypercube (S&F)

1 2 2 3 3 3 3

  • Total duration: Tone_to_all = (ts + mtw) log p
slide-8
SLIDE 8

Broadcast on hypercube: algorithm

Procedure ONE_TO_ALL_BC(d, my_id, X) begin mask := 2d - 1 /* Set all bits of mask to 1 */ for i := d - 1 downto 0 do /* Outer loop */ begin mask := mask XOR 2i /* Set bit i of mask to 0 */ if (my_id AND mask) = 0 then /* the lower i bits of my_id are 0 */ if (my_id AND 2i) = 0 then begin msg_destination := my_id XOR 2i send X to msg_destination end else begin msg_source := my_id XOR 2i receive X from msg_source end endfor end ONE_TO_ALL_BC Only nodes with last i bits equal to 0 participate in communication in ith iteration If my ith bit is 0, I am a sender

  • therwise I am a receiver
slide-9
SLIDE 9

Dual of Broadcast: single-node Accumulation

Procedure ONE_TO_ALL_BC(d, my_id, X) begin mask := 2d - 1 /* Set all bits of mask to 1 */ for i := d - 1 downto 0 do /* Outer loop */ begin mask := mask XOR 2i /* Set bit i of mask to 0 */ if (my_id AND mask) = 0 then /* the lower i bits of my_id are 0 */ if (my_id AND 2i) = 0 then begin msg_destination := my_id XOR 2i send X to msg_destination end else begin msg_source := my_id XOR 2i receive X from msg_source end endfor end ONE_TO_ALL_BC Procedure SINGLE_NODE_ACC(d, my_id,m, X, sum) begin for j := 0 to m - 1 do sum[j] := X[j] mask := 0 for i := 0 to d - 1 do begin /* select node whose lower i bits are 0 */ if (my_id AND mask) = 0 then if (my_id AND 2i) ≠ 0 then begin msg_destination := my_id XOR 2i send sum to msg_destination end else begin msg_source := my_id XOR 2i receive X from msg_source for j := 0 to m - 1 do sum[j] := sum[j] + X[j] end mask := mask XOR 2i endfor end SINGLE_NODE_ACC

slide-10
SLIDE 10

Broadcast on ring (CT)

1 2 2 3 3 3 3

  • Latency of communication at step i: ts + mtw + thp/2i
  • Total duration:

– Tone_to_all = Σi =1…logp(ts + mtw + th p/2i) = ts log p + mtw log p + th (p - 1)

slide-11
SLIDE 11

Broadcast on mesh (CT)

1 2 2 3 3 3 3 4 4 4 4 4 4 4 4

  • Row/column

broadcast time:

– (ts + mtw)log√p + th (√p

  • 1)
  • Total duration:

– (ts + mtw)log p + 2th (√p

  • 1)
slide-12
SLIDE 12

Broadcast on binary tree (CT)

  • Hypercube algorithm

– there are different number of switches traversed along different paths

  • Total duration:

– Tone_to_all = (ts + mtw + th (log p + 1))log p

slide-13
SLIDE 13

All-to-All Broadcast

  • A.k.a multinode broadcast

– message of size m on each processor – at the end of the operation messages are replicated on all procs

  • Dual operation: multinode accumulation (a.k.a

personalized reduction operation)

– each processor is the destination of a single-node accumulation – combination is through an associative operation (sum, product, max, min)

slide-14
SLIDE 14

A2A Broadcast on Ring (S&F)

And so forth, until eventually ...

(6) (5) (4) (0) (1) (2) (3) (7) (5) (4) (3) (7) (0) (1) (2) (6) (0) (7) (6) (3) (3) (4) (5) (1)

  • Number of steps: p - 1
  • Latency of each communication step: ts + mtw
  • Total duration: Tall_to_all = (ts + mtw) (p - 1)
slide-15
SLIDE 15

A2A Broadcast on mesh (S&F)

  • Row broadcast time: (ts + mtw) (√p - 1)
  • Column broadcast time: (ts + √p mtw) (√p - 1)
  • Total duration: Tall_to_all = 2ts(√p - 1) + mtw(p - 1)

Phase 1 Phase 2

slide-16
SLIDE 16

A2A Broadcast on hypercube (S&F)

  • Duration of step i: ts + mtw2i-1
  • Total duration:
  • Tall_to_all = Σi =1…logp(ts + mtw2i-1) = ts log p + mtw(p - 1)