lift a data parallel language for high performance
play

Lift: a Data-Parallel Language for High-Performance Parallel Pattern - PowerPoint PPT Presentation

Lift: a Data-Parallel Language for High-Performance Parallel Pattern Code Generation Christophe Dubach SCALEW, Cambridge 14 th July 2016 Michel Steuwer Thibaut Lutz Toomas Remmelg ... Postdoc former Postdoc PhD student (now at Nvidia)


  1. Lift: a Data-Parallel Language for High-Performance Parallel Pattern Code Generation Christophe Dubach SCALEW, Cambridge 14 th July 2016 Michel Steuwer Thibaut Lutz Toomas Remmelg ... Postdoc former Postdoc PhD student (now at Nvidia)

  2. Big Data → Big Computers Big Computers → Accelerators GPU FPGA CPU/GPU

  3. Top 500 with parallel accelerators

  4. Top 500 with parallel accelerators increasing

  5. Top 500 with parallel accelerators new ones appearing regularly

  6. Top 500 with parallel accelerators Difficult to program Difficult to achieve high performance Moving target

  7. Optimising for accelerators is hard Example: Parallel Array Sum on GPU

  8. Tree-based parallel array sum 5 2 4 1 8 17 1 4 7 5 25 5 12 30 42

  9. Memory accesses Coalesced Compact Naive 5 2 4 1 8 17 1 4 5 2 4 1 8 17 1 4 5 2 4 1 8 17 1 4 0 0 1 1 2 2 3 3 0 1 2 3 0 1 2 3 0 0 1 1 2 2 3 3 0 1 2 3 0 1 2 3 0 1 2 3 7 5 25 5 8 17 1 4 13 19 5 5 8 17 1 4 7 2 5 1 25 17 5 4 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 18 24 5 5 8 17 1 4 12 30 25 5 8 17 1 4 12 2 5 1 30 17 5 4 0 0 0 0 0 0 0 0 0 42 30 25 5 8 17 1 4 42 24 5 5 8 17 1 4 42 2 30 1 25 17 5 4 bad for caches Good for GPU good for caches global memory thread id id

  10. Thread mapping Mix Fine Coarse 5 2 4 1 8 17 1 4 5 2 4 1 8 17 1 4 5 2 4 1 8 17 1 4 3 3 0 0 1 1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 2 3 0 0 1 7 5 25 5 8 17 1 4 42 2 4 1 8 17 1 4 12 30 4 1 8 17 1 4 0 0 1 1 0 0 0 1 0 12 30 25 5 8 17 1 4 42 30 4 1 8 17 1 4 0 0 0 42 30 25 5 8 17 1 4

  11. Basic Implementation Fully Optimized Implementation (Nvidia) kernel kernel void reduce(global fmoat * g_idata, void reduce(global fmoat * g_idata, global fmoat * g_odata, global fmoat * g_odata, unsigned int n, unsigned int n, local volatile fmoat * l_data) { local fmoat * l_data) { unsigned int tid = get_local_id(0); unsigned int tid = get_local_id(0); unsigned int i = unsigned int i = get_global_id(0); get_group_id(0) * (get_local_size(0)*2) l_data[tid] = (i < n) ? g_idata[i] : 0; + get_local_id(0); barrier(CLK_LOCAL_MEM_FENCE); unsigned int gridSize = WG_SIZE * get_num_groups(0); for (unsigned int s=1; l_data[tid] = 0; while (i < n) { s < get_local_size(0); s*= 2) { l_data[tid] += g_idata[i]; if ((tid % (2*s)) == 0) { if (i + WG_SIZE < n) l_data[tid] += l_data[tid + s]; l_data[tid] += g_idata[i+WG_SIZE]; } i += gridSize; } barrier(CLK_LOCAL_MEM_FENCE); barrier(CLK_LOCAL_MEM_FENCE); } if (tid == 0) if (WG_SIZE >= 256) { g_odata[get_group_id(0)] = l_data[0]; if (tid < 128) { } l_data[tid] += l_data[tid+128]; } barrier(CLK_LOCAL_MEM_FENCE); } if (WG_SIZE >= 128) { if (tid < 64) { l_data[tid] += l_data[tid+ 64]; } barrier(CLK_LOCAL_MEM_FENCE); } if (tid < 32) { if (WG_SIZE >= 64) { l_data[tid] += l_data[tid+32]; } • Optimising OpenCL kernels is hard if (WG_SIZE >= 32) { l_data[tid] += l_data[tid+16]; } • Need to understand target hardware if (WG_SIZE >= 16) { l_data[tid] += l_data[tid+ 8]; } if (WG_SIZE >= 8) { l_data[tid] += l_data[tid+ 4]; } if (WG_SIZE >= 4) { • Moving target l_data[tid] += l_data[tid+ 2]; } if (WG_SIZE >= 2) { l_data[tid] += l_data[tid+ 1]; } } • Hardware keeps changing if (tid == 0) g_odata[get_group_id(0)] = l_data[0]; }

  12. 10x improvement for optimised code Nvidia GPU

  13. Unfortunately, performance is not portable AMD GPU Intel CPU

  14. How to achieve performance portability? State-of-the-art: hand-written implementation (maybe parametric) for each device! The Lift approach: - a language to express parallel portion of programs - optimisations and decisions expressed as rewrite rules

  15. Generating Performance Portable Code using Rewrite Rules

  16. High-level expression Low-level expression def add3(int x) = x + 3 def vectorAdd = join ( map-workgroup( rewrite rules def vectorAdd = map(add3) join o map-local( vect-4(add3) ) o asVector-4 ) o split-1024) code generation OpenCL kernel int4 add3(int4 x) { return x + 3; } Kernel void map_add(global int* in,out, int len) { // division into workgroup by chuncks of 1024 for (int i=get_group_id; i < len/1024; i+=get_num_groups) { global int* grp_in = in+(i*1024); global int* grp_out = in+(i*1024); // division into threads by chunks of 4 for (int j=get_local_id; j < 1024/4; j+=get_local_size) { global int* lcl_in = grp_in+(j*4); global int* lcl_out = grp_out+(j*4); // vectorization with vector width of 4 global int4* in_vec4 = (int4*) lcl_in; global int4* out_vec4 = (int4*) lcl_out; *out_vec4 = add3(*in_vec4); } } }

  17. High-level expression Low-level expression def add3(int x) = x + 3 def vectorAdd = join ( map-workgroup( rewrite rules def vectorAdd = map(add3) Functional World join o map-local( vect-4(add3) ) o asVector-4 ) o split-1024) code generation OpenCL kernel int4 add3(int4 x) { return x + 3; } Kernel void map_add(global int* in,out, int len) { // division into workgroup by chuncks of 1024 for (int i=get_group_id; i < len/1024; i+=get_num_groups) { global int* grp_in = in+(i*1024); global int* grp_out = in+(i*1024); // division into threads by chunks of 4 for (int j=get_local_id; j < 1024/4; j+=get_local_size) { global int* lcl_in = grp_in+(j*4); global int* lcl_out = grp_out+(j*4); // vectorization with vector width of 4 global int4* in_vec4 = (int4*) lcl_in; global int4* out_vec4 = (int4*) lcl_out; *out_vec4 = add3(*in_vec4); } } }

  18. High-level expression Low-level expression def add3(int x) = x + 3 def vectorAdd = join ( map-workgroup( rewrite rules def vectorAdd = map(add3) Functional World join o map-local( vect-4(add3) ) o asVector-4 ) o split-1024) code generation OpenCL kernel int4 add3(int4 x) { return x + 3; } Kernel void map_add(global int* in,out, int len) { // division into workgroup by chuncks of 1024 for (int i=get_group_id; i < len/1024; i+=get_num_groups) { global int* grp_in = in+(i*1024); global int* grp_out = in+(i*1024); Imperative World // division into threads by chunks of 4 for (int j=get_local_id; j < 1024/4; j+=get_local_size) { global int* lcl_in = grp_in+(j*4); global int* lcl_out = grp_out+(j*4); // vectorization with vector width of 4 global int4* in_vec4 = (int4*) lcl_in; global int4* out_vec4 = (int4*) lcl_out; *out_vec4 = add3(*in_vec4); } } }

  19. Functional Programming ► Focus on the what rather than the how ► Imperative program float sum(float* input, int length) { float accumulator = 0; for(int i = 0; i < length; i++) accumulator += input[i]; return accumulator; } ► Functional Program reduce (+,0, input) Algorithmic Patterns (or skeletons)

  20. Functional Algorithmic Primitives ⟼ map (f) : ⟼ zip : ⟼ reduce (+, 0): ⟼ split (n): ⟼ join : ⟼ iterate (f, n): ⟼ reorder (σ):

  21. High-level Programs scal (a, vec) = map(*a, vec) asum (vec) = reduce(+, 0, map(abs, vec)) dotProduct (x, y) = reduce(+, 0, map(*, zip(x, y))) gemv (mat, x, y, a, b) = map(+, zip( map(scal(a) o dotProduct(x), mat), scal(b, y) ) )

  22. Case study: Matrix-multiplication

  23. Matrix-multiplication expressed functionally High-level functional expression A x B = map(rowA → map(colB → Reduce(+) o Map(x) o Zip(rowA, colB) , transpose(B)) , A)

  24. How to explore the implementation space?

  25. Algorithmic Rewrite Rules (algebra of parallelism) • Provably correct rewrite rules • Express algorithmic implementation choices

  26. Algorithmic Rewrite Rules (algebra of parallelism) • Provably correct rewrite rules • Express algorithmic implementation choices Split-join rule:

  27. Algorithmic Rewrite Rules (algebra of parallelism) • Provably correct rewrite rules • Express algorithmic implementation choices Split-join rule: Map fusion rule:

  28. Algorithmic Rewrite Rules (algebra of parallelism) • Provably correct rewrite rules • Express algorithmic implementation choices Split-join rule: Map fusion rule: Reduce rules: ...

  29. Matrix-multiplication example High-level functional expression A x B = map(rowA → map(colB → Reduce(+) o Map(x) o Zip(rowA, colB) , transpose(B)) , A)

  30. OpenCL implementation with Register Blocking } blockFactor

  31. OpenCL implementation with Register Blocking } blockFactor

  32. OpenCL implementation with Register Blocking } blockFactor

  33. Register Blocking as a series of rewrites Starting point

  34. Register Blocking as a series of rewrites

  35. Register Blocking as a series of rewrites

  36. Register Blocking as a series of rewrites

  37. Register Blocking as a series of rewrites

  38. Register Blocking as a series of rewrites

  39. Register Blocking as a series of rewrites

  40. Register Blocking as a series of rewrites

  41. Register Blocking as a series of rewrites

  42. Register Blocking expressed functionally } blockFactor

  43. Register Blocking expressed functionally } blockFactor

  44. 1

  45. 1 2

  46. 1 3 2

  47. 1 Job almost done! now need to “map” parallelism 3 2

  48. Mapping Parallelism OpenCL thread hierarchy workgroups global threads local threads map-global map-workgroup map-local map-sequential

Recommend


More recommend