Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping Georgios Goumas, Aristidis Sotiropoulos and Nectarios Koziris National Technical University of Athens, Greece Department of Electrical and Computer Engineering Division of Computer Science Computing Systems Lab www.cslab.ece.ntua.gr nkoziris@cslab.ece.ntua.gr IPDPS 2001-San Francisco
Overview Minimizing overall execution time of nested loops on multiprocessor architectures using message passing How? Loop Tiling for parallelism + Overlapping otherwise interleaved communication and pure computation sub-phases Is it possible? s/w communication layer + hardware should assist OVERALL SCHEDULE IS LIKE A PIPELINED DATAPATH ! IPDPS 2001-San Francisco 2
What is tiling or supernode transformation? • Loop transformation • Partitioning of iteration space J n into n-D parallelepiped areas formed by n families of hyperplanes • Each tile or supernode contains many iteration points within its boundary area • Tile is defined by a square matrix H , each row vector h i perpendicular to a family of hyperplanes • Dually, tile is defined by n column vectors p i which are its sides, P=[ p i ] It holds P = H -1 IPDPS 2001-San Francisco 3
Multilevel Tiling: Tiling at all levels of memory hierarchy! � to increase reuse of register files � to increase reuse of cache lines (tiling for locality) � To increase locality in Virtual Memory and at the upper level: � Tiling to exploit parallelism ! IPDPS 2001-San Francisco 4
Why using tiling for parallelism? • Increases Grain of Computation – Reduces synchronization points (atomic tile execution) • Reduces overall communication cost (increases intraprocessor communication) TRY TO FULLY UTILIZE ALL PROCESSORS (CPUs !!!) IPDPS 2001-San Francisco 5
Tiling Transformation Tiles are atomic, identical, bounded and sweep the index space Hj → = n 2 n r : Z Z , r ( j ) − Hj − 1 j H Hj identifies the coordinates of the tile that j is mapped to − − 1 gives the coordinates of j within that tile relative to the j H Hj tile origin IPDPS 2001-San Francisco 6
Example: A simple 2-D Tiling for j 1 = 0 to 5 h 1 1 0 for j 2 = 0 to 5 = 2 H a( j 1 , j 2 ) = a( j 1 -1 , j 2 ) + a( j 1 -1, j 2 -1 ); 1 0 2 h 2 j 2 2 0 = P 0 2 p 1 p 2 p 2 j 1 { } = ≤ ≤ 2 J ( j , j ) | 0 j , j 5 1 2 1 2 p 1 h 1 h 2 IPDPS 2001-San Francisco 7
Example (cont.) { } = ≤ ≤ 2 J ( j , j ) | 0 j , j 5 j 2 1 2 1 2 5 (0,2) (1,2) 1 j (2,2) = = = ∈ 1 4 S S S 2 2 J j | j Hj , j J 1 j 2 3 2 (1,1) (2,1) 2 ( ) 3 1 2 1 = r ) (1,0) (2,0) ( 4 1 0 j 1 1 4 0 3 5 2 S 2 j − − = ∈ = = ∈ S 1 n 1 S 1 S S S S TOS ( J , H ) j Z | j H j , j ( j , j ) J 1 2 S 2 j 2 IPDPS 2001-San Francisco 8
Another Example: j 3 2 = P 7 1 4 (0,1) (1,1) 6 − 4 2 1 = H 5 − 10 1 3 4 h 1 (1,1) (2,0) 2 3 p 2 (0,0) 8 0 h 2 2 = r 2 5 1 p 1 (1,-1) (2,-1) 3 i 0 1 2 3 4 5 6 7 8 9 IPDPS 2001-San Francisco 9
Tile Computation - Communication Cost The number of iteration points contained in a supernode j S expresses the tile computation cost. The tile communication cost is proportional to the number of iteration points that need to send data to neighboring tiles IPDPS 2001-San Francisco 10
n n m 1 ∑ ∑ ∑ (2) = Minimise V ( H ) h d comm i , k k , j det( H ) = = = i 1 k 1 j 1 1 = = ν Subject to V ( H ) comp det( H ) ≥ HD 0 3 1 4 0 6 2 (1) = = = D , P , P 1 2 1 2 0 5 2 4 = = V V 20 comp 1 comp 2 = = V 27 , V 19 comm 1 comm 2 IPDPS 2001-San Francisco 11
Objectives when Tiling for Parallelism Most methods try to: Given a computation tile volume, try to minimize the communication needs Re-shape Tiles = reduce communication But, how about iteration space size and boundaries? Objective is to minimize overall execution time ….thus we need efficient scheduling IPDPS 2001-San Francisco 12
Scheduling of Tiles ≥ If , tiles are atomic and preserve the lexicographic HD 0 execution ordering How can we schedule tiles to exploit parallelism ? Use similar methods as scheduling loop iterations! Solution: LINEAR TIME SCHEDULING of TILES What about space scheduling? Solution: CHAINS OF TILES TO SAME PROCESSOR IPDPS 2001-San Francisco 13
Linear Schedule Π + j t ( ) = = − Π ∈ n 0 t , wheret min i , i J Π j 0 disp Π + S j t = 0 t Π j S disp Which is the optimal ? ? For non-overlapping schedule: ? = [1 1 1...1] IPDPS 2001-San Francisco 14
For coarse grain tiles, all iteration dependencies are contained within a tile area. Coarse grain ? VERY FAST PROCESSORS COMMUNICATION LATENCY COMM TO COMP RATIO SHOULD BE MEANINGFUL Supernode dependence set contains only unitary dependencies, In other words, every tile communicates with its neighbors, one at each dimension For these unitary inter-tile dependence vectors: Optimal ? is [1 1 1…1] IPDPS 2001-San Francisco 15
? he total number P of time hyperplanes depends on g: j 2 1 0 S j 2 = 2 H 1 0 2 S j 1 Tile grain: g = |H -1 | = 4 j 1 j 2 S 1 j 2 0 = 3 ' H 0 1 3 S j 1 Tile grain: g ’ = |H ’-1 | = 9 IPDPS 2001-San Francisco 16 j 1
Each tile execution phase involves two sub-phases: a) compute and b) communicate results to others How many such phases ? P(g), where P(g) the number of hyperplanes total execution time: T = P(g) (T comp +T comm ), where: T comp =gt c the overall computation time for all iterations within a tile T comm : the communication cost for sending data to neighboring tiles T comm =T startup +T transmit IPDPS 2001-San Francisco 17
S j 1 Mapping along the maximal dimension : Final tile will be S j 2 executed at t = 5+2 x 3+1=12 3 time instance 2 P 2 1 P 1 1 2 3 4 5 S j 1 Optimal linear schedule is given by ? = [1 2] = + + S S S S S For a tile j ( j , j ), t 2 j j 1 1 2 S 2 1 j IPDPS 2001-San Francisco 18
Unit Execution Time � Unit Communication Time GRIDS GRIDS are task graphs with unitary dependencies ONLY! Optimal time schedule for UET-UCT GRIDS is found to be: Assume each supernode is a task. Overlapping Tile Schedule is like a UET-UCT GRID scheduling problem! S S S S The optimal time schedule for tile j ( j , j ,..., j ) is : 1 2 n n ∑ + S S 2 j j , where k is the " largest" dimension i k = i 1 ≠ i k We map all tiles along k dimension to the same processor IPDPS 2001-San Francisco 19
S j 2 Mapping along the non-maximal dimension : : Final tile will be P 1 executed at P 2 S j 2 t = 2x5+3+1=14 time instance 3 2 1 S j 1 1 2 3 4 5 linear schedule now is given by ? = [2 1]. WORSE than before IPDPS 2001-San Francisco 20
S j 2 overlapping P 2 1 P 1 communication 1 2 computation 2 sub-phases: communication + computation Communication in one time step Computation in the next IPDPS 2001-San Francisco 21
Blocking (non-overlapping) case: S j 2 1 P 2 P 1 1 2 computation communication + computation in each time step IPDPS 2001-San Francisco 22
Non overlapping case Each t imestep contains a triplet of receive - compute - send primitives Or, equivalently: Compute - communicate There exists time where every proc is only sending or receiving! BAD processor utilization! IPDPS 2001-San Francisco 23
Various levels of computation to communication overlapping: IPDPS 2001-San Francisco 24
Overlapping case Each timestep is (ideally) either a compute or a send+receive primitive Every proc computes its tile at k step and receives data to use them at k+1 step, while sends data produced a k-1 step IPDPS 2001-San Francisco 25
In Depth analysis of a time step However, there exists non-avoidable startup latencies: Thus overall time T = P ’ (g) max(A 1 +A 2 +A 3 , B 1 +B 2 +B 3 +B 4 ) IPDPS 2001-San Francisco 26
Communication Layer Internals Buffering + copying from user to kernel space Sending through syscal + transmitting through media Startup latency unavoidable (at the moment!) But what about writing to NIC and transmitting? (at least not the process job, but the kernel’s! Steals CPU cycles anyway!) IPDPS 2001-San Francisco 27
Recommend
More recommend