Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Uday Bondhugula Indian Institute of Science Supercomputing 2013 Nov 16–22, 2013 Denver, Colorado 1 / 46
Introduction 1 Distributed-memory code generation 2 The problem, challenges, and past efforts Our approach (Pluto distmem) Experimental Evaluation 3 Conclusions 4 2 / 46
Introduction Distributed-memory compilation Manual parallelization for distributed-memory is extremely hard (even for affine loop nests) Objectives Automatically generate MPI code from sequential C affine loop nests 3 / 46
Introduction Distributed-memory compilation Manual parallelization for distributed-memory is extremely hard (even for affine loop nests) Objectives Automatically generate MPI code from sequential C affine loop nests 3 / 46
Introduction Distributed-memory compilation – why again? Large amount of literature already exists through early 1990s Past works: limited success 1 Still no automatic tool has been available 2 However, we now have new polyhedral libraries, transformation 3 frameworks, code generators, and tools The same techniques are needed to compile for CPUs-GPU 4 heterogeneous multicores Can be integrated with emerging runtimes 5 Make a fresh attempt to solve this problem 4 / 46
Introduction Distributed-memory compilation – why again? Large amount of literature already exists through early 1990s Past works: limited success 1 Still no automatic tool has been available 2 However, we now have new polyhedral libraries, transformation 3 frameworks, code generators, and tools The same techniques are needed to compile for CPUs-GPU 4 heterogeneous multicores Can be integrated with emerging runtimes 5 Make a fresh attempt to solve this problem 4 / 46
Introduction Distributed-memory compilation – why again? Large amount of literature already exists through early 1990s Past works: limited success 1 Still no automatic tool has been available 2 However, we now have new polyhedral libraries, transformation 3 frameworks, code generators, and tools The same techniques are needed to compile for CPUs-GPU 4 heterogeneous multicores Can be integrated with emerging runtimes 5 Make a fresh attempt to solve this problem 4 / 46
Introduction Distributed-memory compilation – why again? Large amount of literature already exists through early 1990s Past works: limited success 1 Still no automatic tool has been available 2 However, we now have new polyhedral libraries, transformation 3 frameworks, code generators, and tools The same techniques are needed to compile for CPUs-GPU 4 heterogeneous multicores Can be integrated with emerging runtimes 5 Make a fresh attempt to solve this problem 4 / 46
Introduction Distributed-memory compilation – why again? Large amount of literature already exists through early 1990s Past works: limited success 1 Still no automatic tool has been available 2 However, we now have new polyhedral libraries, transformation 3 frameworks, code generators, and tools The same techniques are needed to compile for CPUs-GPU 4 heterogeneous multicores Can be integrated with emerging runtimes 5 Make a fresh attempt to solve this problem 4 / 46
Introduction Distributed-memory compilation – why again? Large amount of literature already exists through early 1990s Past works: limited success 1 Still no automatic tool has been available 2 However, we now have new polyhedral libraries, transformation 3 frameworks, code generators, and tools The same techniques are needed to compile for CPUs-GPU 4 heterogeneous multicores Can be integrated with emerging runtimes 5 Make a fresh attempt to solve this problem 4 / 46
Introduction Why do we need communication? Communication during parallelization is a result of data dependences No data dependences ⇒ ( ∼ ) no communication Parallel loop implies no dependences satisfied by it Communication is due to dependences that are satisfied outside but have (non-zero) components on the parallel loop 5 / 46
Introduction Why do we need communication? Communication during parallelization is a result of data dependences No data dependences ⇒ ( ∼ ) no communication Parallel loop implies no dependences satisfied by it Communication is due to dependences that are satisfied outside but have (non-zero) components on the parallel loop 5 / 46
Introduction Why do we need communication? Communication during parallelization is a result of data dependences No data dependences ⇒ ( ∼ ) no communication Parallel loop implies no dependences satisfied by it Communication is due to dependences that are satisfied outside but have (non-zero) components on the parallel loop 5 / 46
b b b b b b b b b b b b b b b b b b b b b b b b b Introduction Dependences and Communication i N 3 2 1 j 0 1 2 3 N Figure : Inner parallel loop, j : hyperplane (0,1) The inner loop can be executed in parallel with communication for each iteration of the outer sequential loop 6 / 46
b b b b b b b b b b b b b b b b b b b b b b b b b Introduction Dependences and Communication i N 3 2 1 j 0 1 2 3 N Figure : Inner parallel loop, j : hyperplane (0,1) The inner loop can be executed in parallel with communication for each iteration of the outer sequential loop 6 / 46
b b b b b b b b b b b b b b b b b b b b b b b b b Introduction Dependences and Communication i N 3 2 1 j 0 1 2 3 N Figure : Inner parallel loop, j : hyperplane (0,1) The inner loop can be executed in parallel with communication for each iteration of the outer sequential loop 6 / 46
b b b b b b b b b b b b b b b b b b b b b b b b b Introduction Dependences and Communication i N 3 2 1 j 0 1 2 3 N Figure : Inner parallel loop, j : hyperplane (0,1) The inner loop can be executed in parallel with communication for each iteration of the outer sequential loop 6 / 46
b b b b b b b b b b b b b b b b b b b b b b b b b Introduction Dependences and Communication i N 3 2 1 j 0 1 2 3 N Figure : Inner parallel loop, j : hyperplane (0,1) The inner loop can be executed in parallel with communication for each iteration of the outer sequential loop 6 / 46
Introduction A polyhedral optimizer – various phases 1 Extracting a polyhedral representation (from sequential C) 2 Dependence analysis 3 Transformation and parallelization 4 Code generation (getting out of polyhedral extraction) S1 S1 S2 ST i=1 x ) = k L1: LABEL φ ( � L r.i=i S2 M r300=r.i,4 for (i = 0; i < N; i++) L r.i=i for (j = 0; j < N; j++) φ ( � x ) ≥ k S2 φ A r310=r.i,1 ST i=r310 S3 0 <= i <= N−1 φ ( � x ) ≤ k L r.i=i 0 <= j <= N−1 S4 0 <= k <= N−1 C cr320=r.i,10 for (i = 0; i < N; i++) BT L1,cr320,le for (j = 0; j < N; j++) k i for (k = 0; k < N; k++) S1 j S5 7 / 46
Introduction A polyhedral optimizer – various phases 1 Extracting a polyhedral representation (from sequential C) 2 Dependence analysis 3 Transformation and parallelization 4 Code generation (getting out of polyhedral extraction) S1 S1 S2 ST i=1 x ) = k L1: LABEL φ ( � L r.i=i S2 M r300=r.i,4 for (i = 0; i < N; i++) L r.i=i for (j = 0; j < N; j++) φ ( � x ) ≥ k S2 φ A r310=r.i,1 ST i=r310 S3 0 <= i <= N−1 φ ( � x ) ≤ k L r.i=i 0 <= j <= N−1 S4 0 <= k <= N−1 C cr320=r.i,10 for (i = 0; i < N; i++) BT L1,cr320,le for (j = 0; j < N; j++) k i for (k = 0; k < N; k++) S1 j S5 7 / 46
Introduction Distributed-memory parallelization Involves a number of sub-problems 1 Finding the right computation partitioning 2 Data distribution and data allocation (weak scaling) 3 Determining communication sets given the above 4 Packing and unpacking data 5 Determining communication partners given the above 8 / 46
Introduction Distributed-memory parallelization Involves a number of sub-problems 1 Finding the right computation partitioning 2 Data distribution and data allocation (weak scaling) 3 Determining communication sets given the above 4 Packing and unpacking data 5 Determining communication partners given the above 8 / 46
Distributed-memory code generation Introduction 1 Distributed-memory code generation 2 The problem, challenges, and past efforts Our approach (Pluto distmem) Experimental Evaluation 3 Conclusions 4 9 / 46
Distributed-memory code generation The problem, challenges, and past efforts Distributed-memory code generation What to send? Whom to send to? Difficulties For non-uniform dependences, not known how far dependences traverse Number of iterations (or tiles) is not known at compile time Number of processors may not be known at compile time (portability) Virtual to physical processor approach: are you sending to two virtual processors that are the same physical processor? 10 / 46
Recommend
More recommend