stateful dataflow multigraphs a data centric model for
play

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Tal Ben-Nun , Johannes de Fine Licht, Alexandros-Nikolaos Ziogas, Timo Schneider, Torsten Hoefler Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures This


  1. spcl.inf.ethz.ch @spcl_eth Tal Ben-Nun , Johannes de Fine Licht, Alexandros-Nikolaos Ziogas, Timo Schneider, Torsten Hoefler Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures This project has received funding from the European Research Council (ERC) under grant agreement "DAPP (PI: T. Hoefler)".

  2. spcl.inf.ethz.ch @spcl_eth Motivation Slide courtesy of NVIDIA 4

  3. spcl.inf.ethz.ch @spcl_eth Source: US DoE Computational Scientist 6

  4. spcl.inf.ethz.ch @spcl_eth Domain Scientist Performance Engineer 7

  5. spcl.inf.ethz.ch @spcl_eth Optimization Techniques ▪ Multi-core CPU ▪ Tiling for complex cache hierarchies ▪ Register optimizations ▪ Vectorization ▪ Many-core GPU ▪ Coalesced memory access ▪ Warp divergence minimization, register tiling ▪ Task fusion ▪ FPGA ▪ Maximize resource utilization (logic units, DSPs) ▪ Streaming optimizations, pipelining ▪ Explicit buffering (FIFO) and wiring 8

  6. spcl.inf.ethz.ch @spcl_eth aCe Overview System Domain Scientist Performance Engineer Problem Formulation Hardware 𝜖𝑣 Information 𝜖𝑢 − 𝛽𝛼 2 𝑣 = 0 Compiler Transformed Dataflow Python DSLs Data-Centric Intermediate Representation (SDFG) TensorFlow MATLAB … CPU Binary 𝑴 𝑺 Performance Runtime * Results * GPU Binary * * Scientific Frontend * * FPGA Modules Graph Transformations 9

  7. spcl.inf.ethz.ch @spcl_eth Dataflow Programming in DaCe x 𝑧 = 𝑦 2 + sin 𝑦 Memlets Tasklet 𝜌 y 10

  8. spcl.inf.ethz.ch @spcl_eth Parallel Dataflow Programming A A[0] A[1] A[N-1] … Tasklet Tasklet Tasklet B[0] B[1] B[N-1] B 11

  9. spcl.inf.ethz.ch @spcl_eth Parallel Dataflow Programming A A A[0:N] A[0] A[1] A[N-1] [i=0:N] A[i] … Tasklet Tasklet Tasklet Scope Tasklet B[i] B[0] B[1] B[N-1] [i=0:N] B B[0:N] B 12

  10. spcl.inf.ethz.ch @spcl_eth Stateful Parallel Dataflow Programming A C A[0:N] C[0:N] [i=0:N] [i=0:N] A[i] C[i] Tasklet Tasklet B[i] A[i] [i=0:N] [i=0:N] B[0:N] A[0:N] B A 13

  11. spcl.inf.ethz.ch @spcl_eth Stateful Parallel Dataflow Programming State s0 State s1 A C A[0:N] C[0:N] [i=0:N] [i=0:N] A[i] C[i] Tasklet Tasklet B[i] A[i] [i=0:N] [i=0:N] B[0:N] A[0:N] B A 14

  12. spcl.inf.ethz.ch @spcl_eth Example: 2D Stencil State s1 A A[0:H,0:W] [y=0:H, x=0:W] A[y,x-1] A[y,x+1] A[y-1,x] A[y+1,x] Jacobi State s0 B[y,x] [y=0:H,x=0:W] [y=0:H,x=0:W] ∅ B[0:H,0:W] t=0 Initialize t ≥ T B B[y,x] B[0:H,0:W] [y=0:H, x=0:W] [y=0:H,x=0:W] B[y,x-1] B[y,x+1] B[y-1,x] B[y+1,x] B[0:H,0:W] Jacobi A[y,x] B [y=0:H,x=0:W] t < T A[0:H,0:W] A t < T; t++ 15

  13. spcl.inf.ethz.ch @spcl_eth Meet the Nodes State State machine element Tasklet Fine-grained computational block N-dimensional data container Array Parametric graph abstraction for parallelism Exit Map Streaming data container Stream Dynamic mapping of computations on streams Exit Consume Defines behavior during conflicting writes Conflict Resolution 16

  14. spcl.inf.ethz.ch @spcl_eth Meet the Nodes State s0 A State State machine element A[0:N] [i=0:N] Tasklet Fine-grained computational block A[i] Filter N-dimensional data container Array Bsize(+) S [i=0:N] Parametric graph abstraction for parallelism Exit Map S S Bsize(+) Streaming data container Stream B[0:N] Dynamic mapping of computations on streams Exit Consume Bsize B Defines behavior during conflicting writes Conflict Resolution 17

  15. spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity ▪ Maps have schedules, arrays have storage locations A A[0:N] [i=0:N:TN] CPU A[i:i+TN] [ti=0:TN] Core A[i+ti] out = in_A * in_A … 18

  16. spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity ▪ Maps have schedules, arrays have storage locations A // ... A[0:N] #pragma omp parallel for [i=0:N:TN] for (int i = 0; i < N; i += TN) { CPU vec<double, 4> tA[TN]; A[i:i+TN] Global2Stack_1D <double, 4, 1> ( &A[i], min(N – i, TN), tA); tA for (int ti = 0; ti < TN; ti += 1) { tA[0:TN] [ti=0:TN] vec<double, 4> in_A = tA[ti]; Core auto out = (in_A * in_A); tA[ti] tC[ti] = out; } out = in_A * in_A … 19

  17. spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity A A[0:N] [i=0:N:TN] CPU A[i:i+TN] tA tA[0:TN] [ti=0:TN] Core tA[ti] out = in_A * in_A … 20

  18. spcl.inf.ethz.ch @spcl_eth Hierarchical Parallelism and Heterogeneity A A[0:N] gA __global__ void multiplication_1(...) { gA[0:N] int i = blockIdx.x * TN; GPU [i=0:N:TN] int ti = threadIdx.y + 0; Device if (i+ti >= N) return; gA[i:i+TN] __shared__ vec<double, 2> tA[TN]; tA GlobalToShared1D <double, 2, TN, 1, 1, false>(gA, tA); tA[0:TN] vec<double, 2> in_A = tA[ti]; GPU [ti=0:TN] auto out = (in_A * in_A); Block tC[ti] = out; tA[ti] } out = in_A * in_A … 21

  19. spcl.inf.ethz.ch @spcl_eth Hardware Mapping: Load/Store Architectures ▪ Recursive code generation (C++, CUDA) Control flow: Construct detection and gotos // ... #pragma omp parallel for for (int i = 0; i < N; i += TN) { ▪ Parallelism vec<double, 4> tA[TN]; Global2Stack_1D <double, 4, 1> ( Multi-core CPU : OpenMP, atomics, and threads &A[i], min(N – i, TN), tA); GPU : CUDA kernels and streams for (int ti = 0; ti < TN; ti += 1) { Connected components run concurrently vec<double, 4> in_A = tA[ti]; auto out = (in_A * in_A); tC[ti] = out; } ▪ Memory and interaction with accelerators Array-array edges create intra-/inter-device copies 22

  20. spcl.inf.ethz.ch @spcl_eth Mapping to Reconfigurable Hardware ▪ Module generation with HDL and HLS Xilinx SDAccel Intel FPGA (experimental) ▪ Parallelism Exploiting temporal locality: pipelines Exploiting spatial locality: vectorization, replication ▪ Replication Enables parametric systolic array generation 23

  21. spcl.inf.ethz.ch @spcl_eth Data-centric Parallel Programming for Python ▪ Programs are integrated within existing codes In Python, integrated functions in existing code @dace.program In MATLAB, separate .m files def program_numpy(A, B): B[:] = np.transpose(A) In TensorFlow, takes existing graph ▪ In Python: Implicit and Explicit Dataflow @dace.program Implicit: numpy syntax def program_explicit(A, B): Explicit: Enforce memory access decoupling from computation @dace.map def transpose(i: _[0:N], j: _[0:M]): ▪ Output compatible with existing programs a << A[i,j] b >> B[j,i] C-compatible SO/DLL file with autogenerated include file b = a 24

  22. spcl.inf.ethz.ch @spcl_eth Matrix Multiplication SDFG State s0 A B A[0:M,0:K] B[0:K,0:N] [i=0:M, j=0:N, k=0:K] @dace.program A[i,k] B[k,j] def gemm(A: dace.float64[M, K], B: dace.float64[K, N], C: dace.float64[M, N]): multiplication # Transient variable tmp[i,j,k] tmp = np.ndarray([M, N, K], dtype=A.dtype) [i=0:M, j=0:N, k=0:K] @dace.map tmp[0:M,0:N,0:K] def multiplication(i: _[0:M], j: _[0:N], k: _[0:K]): in_A << A[i,k] tmp in_B << B[k,j] tmp[0:M,0:N,0:K] out >> tmp[i,j,k] Reduce out = in_A * in_B [axis: 2, sum] dace.reduce(lambda a, b: a + b, tmp, C, axis=2) C[0:M,0:N] C 25

  23. spcl.inf.ethz.ch @spcl_eth Matrix Multiplication SDFG State s0 A B A[0:M,0:K] B[0:K,0:N] [i=0:M, j=0:N, k=0:K] @dace.program A[i,k] B[k,j] def gemm(A: dace.float64[M, K], B: dace.float64[K, N], C: dace.float64[M, N]): multiplication # Transient variable C (+) [i,j] tmp = np.ndarray([M, N, K], dtype=A.dtype) [i=0:M, j=0:N, k=0:K] @dace.map def multiplication(i: _[0:M], j: _[0:N], k: _[0:K]): C[0:M,0:N] in_A << A[i,k] C in_B << B[k,j] out >> tmp[i,j,k] out = in_A * in_B dace.reduce(lambda a, b: a + b, tmp, C, axis=2) 26

  24. spcl.inf.ethz.ch @spcl_eth MapReduceFusion Transformation my_tasklet my_tasklet $A[$ar] $B[$ar] * * $A[:] $B[$br] X $A arr arr $A[:] $B $REDUCE $B[$br] $B 27

  25. spcl.inf.ethz.ch @spcl_eth Programming Model Challenges col[:] x[:] val[:] prow[0:2] [y=0:H,x=0:W] ∅ [j=prow[0]:prow[1]] x(1)[:] col[j] x y init indirection update i = 0 x y val[j] x y x_in 𝒚 𝟑 + 𝒛 𝟑 < 𝟓; i = i + 1 i multiply image[y,x] b[i] (Sum) [y=0:H,x=0:W] [j=prow[0]:prow[1]] image[0:H,0:W] b[:] (Sum) image Indirect memory access Nested state machines 29

  26. spcl.inf.ethz.ch @spcl_eth DIODE (or: Data-centric Integrated Optimization Development Environment) 32

  27. spcl.inf.ethz.ch @spcl_eth DIODE (or: Data-centric Integrated Optimization Development Environment) Transformations SDFG Source Code (malleable) SDFG Properties Generated Code Transformation History 33

  28. spcl.inf.ethz.ch @spcl_eth Performance Naïve SDFG 34

  29. spcl.inf.ethz.ch @spcl_eth Performance MapReduceFusion Naïve SDFG 35

  30. spcl.inf.ethz.ch @spcl_eth Performance LoopReorder MapReduceFusion Naïve SDFG 36

  31. spcl.inf.ethz.ch @spcl_eth Performance BlockTiling LoopReorder MapReduceFusion Naïve SDFG 37

  32. spcl.inf.ethz.ch @spcl_eth Performance RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 38

  33. spcl.inf.ethz.ch @spcl_eth Performance LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusion Naïve SDFG 39

Recommend


More recommend