spcl.inf.ethz.ch @spcl_eth Tal Ben-Nun , Johannes de Fine Licht, Alexandros-Nikolaos Ziogas, Timo Schneider, Torsten Hoefler Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures This project has received funding from the European Research Council (ERC) under grant agreement "DAPP (PI: T. Hoefler)".
spcl.inf.ethz.ch @spcl_eth Motivation Slide courtesy of NVIDIA 4
spcl.inf.ethz.ch @spcl_eth Source: US DoE Computational Scientist 6
spcl.inf.ethz.ch @spcl_eth Domain Scientist Performance Engineer 7
spcl.inf.ethz.ch @spcl_eth Optimization Techniques ▪ Multi-core CPU ▪ Tiling for complex cache hierarchies ▪ Register optimizations ▪ Vectorization ▪ Many-core GPU ▪ Coalesced memory access ▪ Warp divergence minimization, register tiling ▪ Task fusion ▪ FPGA ▪ Maximize resource utilization (logic units, DSPs) ▪ Streaming optimizations, pipelining ▪ Explicit buffering (FIFO) and wiring 8
spcl.inf.ethz.ch @spcl_eth aCe Overview System Domain Scientist Performance Engineer Problem Formulation Hardware 𝜖𝑣 Information 𝜖𝑢 − 𝛽𝛼 2 𝑣 = 0 Compiler Transformed Dataflow Python DSLs Data-Centric Intermediate Representation (SDFG) TensorFlow MATLAB … CPU Binary 𝑴 𝑺 Performance Runtime * Results * GPU Binary * * Scientific Frontend * * FPGA Modules Graph Transformations 9
spcl.inf.ethz.ch @spcl_eth Dataflow Programming in DaCe x 𝑧 = 𝑦 2 + sin 𝑦 Memlets Tasklet 𝜌 y 10
spcl.inf.ethz.ch @spcl_eth Parallel Dataflow Programming A A[0] A[1] A[N-1] … Tasklet Tasklet Tasklet B[0] B[1] B[N-1] B 11
spcl.inf.ethz.ch @spcl_eth Parallel Dataflow Programming A A A[0:N] A[0] A[1] A[N-1] [i=0:N] A[i] … Tasklet Tasklet Tasklet Scope Tasklet B[i] B[0] B[1] B[N-1] [i=0:N] B B[0:N] B 12
spcl.inf.ethz.ch @spcl_eth Stateful Parallel Dataflow Programming A C A[0:N] C[0:N] [i=0:N] [i=0:N] A[i] C[i] Tasklet Tasklet B[i] A[i] [i=0:N] [i=0:N] B[0:N] A[0:N] B A 13
spcl.inf.ethz.ch @spcl_eth Stateful Parallel Dataflow Programming State s0 State s1 A C A[0:N] C[0:N] [i=0:N] [i=0:N] A[i] C[i] Tasklet Tasklet B[i] A[i] [i=0:N] [i=0:N] B[0:N] A[0:N] B A 14
spcl.inf.ethz.ch @spcl_eth Example: 2D Stencil State s1 A A[0:H,0:W] [y=0:H, x=0:W] A[y,x-1] A[y,x+1] A[y-1,x] A[y+1,x] Jacobi State s0 B[y,x] [y=0:H,x=0:W] [y=0:H,x=0:W] ∅ B[0:H,0:W] t=0 Initialize t ≥ T B B[y,x] B[0:H,0:W] [y=0:H, x=0:W] [y=0:H,x=0:W] B[y,x-1] B[y,x+1] B[y-1,x] B[y+1,x] B[0:H,0:W] Jacobi A[y,x] B [y=0:H,x=0:W] t < T A[0:H,0:W] A t < T; t++ 15
spcl.inf.ethz.ch @spcl_eth Meet the Nodes State State machine element Tasklet Fine-grained computational block N-dimensional data container Array Parametric graph abstraction for parallelism Exit Map Streaming data container Stream Dynamic mapping of computations on streams Exit Consume Defines behavior during conflicting writes Conflict Resolution 16
spcl.inf.ethz.ch @spcl_eth Meet the Nodes State s0 A State State machine element A[0:N] [i=0:N] Tasklet Fine-grained computational block A[i] Filter N-dimensional data container Array Bsize(+) S [i=0:N] Parametric graph abstraction for parallelism Exit Map S S Bsize(+) Streaming data container Stream B[0:N] Dynamic mapping of computations on streams Exit Consume Bsize B Defines behavior during conflicting writes Conflict Resolution 17
spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity ▪ Maps have schedules, arrays have storage locations A A[0:N] [i=0:N:TN] CPU A[i:i+TN] [ti=0:TN] Core A[i+ti] out = in_A * in_A … 18
spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity ▪ Maps have schedules, arrays have storage locations A // ... A[0:N] #pragma omp parallel for [i=0:N:TN] for (int i = 0; i < N; i += TN) { CPU vec<double, 4> tA[TN]; A[i:i+TN] Global2Stack_1D <double, 4, 1> ( &A[i], min(N – i, TN), tA); tA for (int ti = 0; ti < TN; ti += 1) { tA[0:TN] [ti=0:TN] vec<double, 4> in_A = tA[ti]; Core auto out = (in_A * in_A); tA[ti] tC[ti] = out; } out = in_A * in_A … 19
spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity A A[0:N] [i=0:N:TN] CPU A[i:i+TN] tA tA[0:TN] [ti=0:TN] Core tA[ti] out = in_A * in_A … 20
spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity A A[0:N] gA __global__ void multiplication_1(...) { gA[0:N] int i = blockIdx.x * TN; GPU [i=0:N:TN] int ti = threadIdx.y + 0; Device if (i+ti >= N) return; gA[i:i+TN] __shared__ vec<double, 2> tA[TN]; tA GlobalToShared1D <double, 2, TN, 1, 1, false>(gA, tA); tA[0:TN] vec<double, 2> in_A = tA[ti]; GPU [ti=0:TN] auto out = (in_A * in_A); Block tC[ti] = out; tA[ti] } out = in_A * in_A … 21
spcl.inf.ethz.ch @spcl_eth Mapping to Reconfigurable Hardware ▪ Module generation with HDL and HLS Xilinx SDAccel Intel FPGA (experimental) ▪ Parallelism Exploiting temporal locality: pipelines Exploiting spatial locality: vectorization, replication ▪ Replication Enables parametric systolic array generation 23
spcl.inf.ethz.ch @spcl_eth Data-centric Parallel Programming for Python ▪ Programs are integrated within existing codes In Python, integrated functions in existing code @dace.program In MATLAB, separate .m files def program_numpy(A, B): B[:] = np.transpose(A) In TensorFlow, takes existing graph ▪ In Python: Implicit and Explicit Dataflow @dace.program Implicit: numpy syntax def program_explicit(A, B): Explicit: Enforce memory access decoupling from computation @dace.map def transpose(i: _[0:N], j: _[0:M]): ▪ Output compatible with existing programs a << A[i,j] b >> B[j,i] C-compatible SO/DLL file with autogenerated include file b = a 24
spcl.inf.ethz.ch @spcl_eth Matrix Multiplication SDFG State s0 A B A[0:M,0:K] B[0:K,0:N] [i=0:M, j=0:N, k=0:K] @dace.program A[i,k] B[k,j] def gemm(A: dace.float64[M, K], B: dace.float64[K, N], C: dace.float64[M, N]): multiplication # Transient variable tmp[i,j,k] tmp = np.ndarray([M, N, K], dtype=A.dtype) [i=0:M, j=0:N, k=0:K] @dace.map tmp[0:M,0:N,0:K] def multiplication(i: _[0:M], j: _[0:N], k: _[0:K]): in_A << A[i,k] tmp in_B << B[k,j] tmp[0:M,0:N,0:K] out >> tmp[i,j,k] Reduce out = in_A * in_B [axis: 2, sum] dace.reduce(lambda a, b: a + b, tmp, C, axis=2) C[0:M,0:N] C 25
spcl.inf.ethz.ch @spcl_eth Matrix Multiplication SDFG State s0 A B A[0:M,0:K] B[0:K,0:N] [i=0:M, j=0:N, k=0:K] @dace.program A[i,k] B[k,j] def gemm(A: dace.float64[M, K], B: dace.float64[K, N], C: dace.float64[M, N]): multiplication # Transient variable C (+) [i,j] tmp = np.ndarray([M, N, K], dtype=A.dtype) [i=0:M, j=0:N, k=0:K] @dace.map def multiplication(i: _[0:M], j: _[0:N], k: _[0:K]): C[0:M,0:N] in_A << A[i,k] C in_B << B[k,j] out >> tmp[i,j,k] out = in_A * in_B dace.reduce(lambda a, b: a + b, tmp, C, axis=2) 26
spcl.inf.ethz.ch @spcl_eth MapReduceFusion Transformation my_tasklet my_tasklet $A[$ar] $B[$ar] * * $A[:] $B[$br] X $A arr arr $A[:] $B $REDUCE $B[$br] $B 27
spcl.inf.ethz.ch @spcl_eth Programming Model Challenges col[:] x[:] val[:] prow[0:2] [y=0:H,x=0:W] ∅ [j=prow[0]:prow[1]] x(1)[:] col[j] x y init indirection update i = 0 x y val[j] x y x_in 𝒚 𝟑 + 𝒛 𝟑 < 𝟓; i = i + 1 i multiply image[y,x] b[i] (Sum) [y=0:H,x=0:W] [j=prow[0]:prow[1]] image[0:H,0:W] b[:] (Sum) image Indirect memory access Nested state machines 29
spcl.inf.ethz.ch @spcl_eth DIODE (or: Data-centric Integrated Optimization Development Environment) 32
spcl.inf.ethz.ch @spcl_eth DIODE (or: Data-centric Integrated Optimization Development Environment) Transformations SDFG Source Code (malleable) SDFG Properties Generated Code Transformation History 33
spcl.inf.ethz.ch @spcl_eth Performance Naïve SDFG 34
spcl.inf.ethz.ch @spcl_eth Performance MapReduceFusion Naïve SDFG 35
spcl.inf.ethz.ch @spcl_eth Performance LoopReorder MapReduceFusion Naïve SDFG 36
spcl.inf.ethz.ch @spcl_eth Performance BlockTiling LoopReorder MapReduceFusion Naïve SDFG 37
spcl.inf.ethz.ch @spcl_eth Performance RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 38
spcl.inf.ethz.ch @spcl_eth Performance LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 39
spcl.inf.ethz.ch @spcl_eth Performance PromoteTransient LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 40
spcl.inf.ethz.ch @spcl_eth Performance Intel MKL 25% difference DaCe With tuning: 98.6% of MKL OpenBLAS 41
spcl.inf.ethz.ch @spcl_eth Intel Xeon E5-2650 v4 NVIDIA Tesla P100 Xilinx VU9P General Compilers SDFG GCC 8, Clang 6, icc 18, NVCC 9.2, SDAccel Polyhedral Optimizers Frameworks & Libraries HPX, Halide, Intel MKL, CUBLAS, Polly 6, Pluto 0.11.4, PPCG 0.8 CUSPARSE, CUTLASS, CUB 42
Recommend
More recommend