Orthogonal Abstractions for Scheduling and Storage Mappings Dagstuhl October 26, 2017 Michelle Mills Strout (University of Arizona)
Collaborators and Funding Ian Bertolacci (Ph.D. student at U Arizona) • Catherine Olschanowsky (faculty at Boise • State University) Eddie Davis (Ph.D. student at Boise State • University) Mary Hall (University of Utah) • Anand Venkat (Intel, PhD in 2016) • Brad Chamberlain and Ben Harshbarger (Cray) • Stephen Guzik (Colorado State University) • Xinfeng Gao (Colorado State University) • Christopher Wilcox (Ph.D, CSU instructor) • Andrew Stone (Ph.D., now at Mathworks) • Christopher Krieger (Ph.D., now at University • of Maryland) The presented research was supported by a Department of Energy Early Career Grant DE-SC0003956, a National Science Foundation CAREER grant CCF 0746693, and a National Science Foundation grant CCF-1422725. 2 2
University of Arizona is Hiring • Tenure track faculty (send email to mstrout@cs.arizona.edu) • Teaching faculty • Graduate students 3
Example: Need to Schedule and Do Storage Mapping Across Operators • Modularized per operator • To reduce synchronization costs need to aggregate • To improve data locality need to aggregate based on data reuse • Approaches to do this need to schedule across function calls and loops 4
Schedule across loops for Asynchronous Parallelism Break computation that sweeps over mesh/sparse matrix into chunks/sparse tiles Full Sparse Tiled Task Graph Iteration Space 5 5
Scheduling Across Loops Shift and Fuse and Overlap Tile • Tiles now include extra flux computations • All tiles can be executed in parallel • Pros and cons • Performs the best • Fuse provides good temporal locality • Tiling provides good spatial locality • Overlap enables all tiles to start in parallel 6 6
Results When Done By Hand WITH Storage Mapping (SC14) Goal: programmable scheduling for data locality and parallelism (SC 2014) A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing 7 PDE Solvers, Olschanowsky et al.
The Compiler Needs Help • A general purpose compiler does well with some optimizations, but … • program analysis to determine code structure is hard • automatic parallelization and optimizations for improving memory hierarchy use are hard • We look at ways to • Enable the programmer to specify the WHAT and HOW separately (programmer wants control) • Still provide compiler with enough information so that it can provide performance portability 8 8
Domain-Specific Library: OP2 – Unstructured Mesh DSL • Exposes data access information • Global (compiler flag) and local data layout specification Data Access Functions Data layout Array op_par_loop (adt_calc, "adt_calc", cells, op_arg_dat(p_x, 0, pcell, 2, “ double:soa ", OP_READ), op_arg_dat(p_x, 1, pcell, 2, “ double:soa ", OP_READ), op_arg_dat(p_x, 2, pcell, 2, “ double:soa ", OP_READ), op_arg_dat(p_x, 3, pcell, 2, “ double:soa ", OP_READ), op_arg_dat(p_q, -1, OP_ID, PDIM*2, “ double:soa ", OP_READ), op_arg_dat(p_adt, -1, OP_ID, 1, "double", OP_WRITE)); Reguly et al. “ Acceleration of a Full-Scale Industrial CFD Application with OP2 ”, TPDS 2016. 9 9
Task-Like Library: CNC – Concurrent Collections • Computation is specified as steps (tasks) • Steps get and put tagged data • Ability to specify run-time scheduling and garbage collection policies orthogonal Schlimbach, F. and Brodman, J.C. and Knobe, K. “ Concurrent Collections on Distributed Memory Theory Put into Practice”, PDP 2013. 10
Data Parallel Library: Kokkos • Can orthogonally specify multi-dimensional array layout • Row-major • Column-major • Morton order • Tiled • Per dimension padding • ... View<double**,Tile<8,8>,Space> m(“matrix”,N,N); H. Carter Edwards and Christian Trott. “ Kokkos, Manycore Device Performance Portability for C++ HPC Applications”, GPU Tech Conference, 2015. 11
Pragma-Based Approach: Loop Chain Abstraction (HIPS13, WACCPD16) Serial within tiles Wavefront over tiles Tile size Tile resulting loop Fuse all loops # pragma omplc loopchain schedule(fuse, tile((10,10),wavefront,serial)) { // 2-Dimensional loop nest, from 1 to N (inclusive) # pragma omplc domain(1:N,1:N) ... for ( int i = 1; i <= N; i += 1 ) for ( int j = 1; j <= N; j += 1 ) /* Stuff */ // 2-Dimensional loop nest, from 1 to N (inclusive) # pragma omplc domain(1:N,1:N) ... for ( int i = 1; i <= N; i += 1 ) for ( int j = 1; j <= N; j += 1 ) /* Things */ } (WACCPD 2016) Identifying and Scheduling Loop Chains Using Directives, Bertolacci et al. 12
Loop Chain Schedule Example (1/3) After Loop Fusion Original Loop Chain Loop Chain Loop Nest Loop Nest Loop Nest schedule( fuse ) 13 13
Loop Chain Schedule Example (2/3) Original Loop Chain After Tiling After Loop Fusion Loop Chain Loop Nest Loop Nest Tile Tile Loop Nest Loop Nest schedule( fuse, tile( (2,2), parallel, serial ) ) 14 14
Loop Chain Schedule Example (3/3) Original Loop Chain After Tiling After Fusing Loop Chain Loop Chain Loop Nest Tile Tile Loop Nest Loop Nest Tile Tile Tile Tile Loop Nest Loop Nest Tile Tile schedule( tile( (2,2), serial, serial ), fuse ) 15 15
Compiler Scripting Approach: Inspector-Executor Transformations Composed by Compiler (CGO14, PLDI15, SC16) Irregular Computation Compile time Runtime Scripting and Autotuning Inspector 1 Index Frameworks that use (e.g. index set splitting) Arrays Uninterpreted Functions Explicit Functions Composed Inspector Inspector 2 (e.g. compact-and-pad) Explicit CUDA-CHiLL Explicit Functions Functions CHiLL compiler Executor (Transformed Irregular Programme Inspector K Computation) Inspector/Executor r-Defined API
CHiLL-I/E Example: Sparse Triangular Solve Runtime Irregular Computation Index for (i=0; i<N; i++) { Arrays for (j=idx[i]; j<idx[i+1]; j++){ x[i] = x[i] – A[j]*x[col[j]]; }} Inspector X CHiLLIE::func part-par(EF col, EF idx){ Compile time CHiLLIE::func level_set; // BFS traversal of Deps doing gets. Scripting ... idx(i) ... col(j) ... level_set() = part-par(< i loop >) // Place appropriate i’s in each set. ... level_set(l).insert(i) ... return level_set; } Dependencies and Scheduling using Uninterpreted Functions Explicit Functions Deps = {[i]->[i’]: i<i’ and Executor i=col(j) and idx(i)<=j<idx(i+1) } for (l=0; l<M; l++){ #omp parallel for Sched = {[i,j]->[l,i,j]: i in for (i in level_set(l)){ level_set(l) } for (j=index[i]; j<index[i+1]; j++){ (l sequential, i parallel, j sequential) x[i] = x[i] – A[j]*x[col[j]]; }}}
Moving Forward: Making schedules available in libraries (ICS 2015) Diamond slab tiling made • Diamond slab tiling • available as a Chapel written in C iterator int Li=0, Ui=N, Lj=0, Uj=N; for(int ts=0; ts<T ; ts+=subset_s) We want to transform our original schedule: // loops over bottom left, middle, top right for (int c0 = -2; c0<=0; c0+=1) for t in timeRange do // loops horizontally? for (int c1 = 0; c1 <= (Uj+tau-3)/(tau-3) ; c1+=1) // loops vertically?, but without skew forall (x,y) in spaceDomain do for (int x = (-Ui-tau+2)/(tau-3); x<=0 ; x += 1){ int c2 = x-c1; //skew computation( t, x, y ); // loops for time steps within a slab // (slices within slabs) for (int c3 = 1; c3<=subset_s; c3 += 1) for (int c4 = max(max(max(-tau * c1 - tau * c2 + 2 * c3 - (2*tau-2), -Uj - tau * c2 + c3 - (tau-2)), tau * c0 - tau * c1 - tau * c2 - c3), Li); c4 <= into a faster schedule: min(min(min(tau * c0 - tau * c1 - tau * c2 - c3 + (tau- 1), -tau * c1 - tau * c2 + 2 * c3), -Lj - tau * c2 + c3), Ui - 1); c4 += 1) forall (t,x,y) for (int c5 = max(max(tau * c1 - c3, Lj), - in diamondTileIterator(…) do tau * c2 + c3 - c4 - (tau-1)); c5 <= min(min(Uj - 1, -tau * c2 + c3 - c4), tau * c1 - c3 + (tau-1)); c5 += 1) computation( t, x, y ); computation(c3, c4, c5); } 18 18
Pragma-Based Approach: Lookup Table Optimization (JSP 2011, SCAM 2012, JSEP 2013) • Source-to-source translation tool called Original Mesa built on ROSE compiler Code Performance Profiling & • Up to 6.8x speedup on set of 5 Scope Identification Expression Enumeration & applications including Small Angle X-ray Domain Profiling Error Analysis & Scattering Performance Modeling Construct & Solve Optimization Problem • Provides user pareto optimal set of LUT Code Generation & Integration transformations to choose from Optimized Code Performance & Accuracy Evaluation 19 19
Fortran Library-Based Approach: Gridweaver for semi-regular grids (ICS 2013) Specify stencil computation, grid, and parallel distribution orthogonally. Tripole Geodesic Cubed sphere 20
Summary: Potential Stack of Abstractions for Specifying Schedules and/or Storage Mappings Data Computation Storage Mapping Scheduling High Level DSL OP2 Kokkos SoA or AoS data layout Loop Chain Loop “Tables” Commands chain map, reduce, ... Chapel Domains Chapel Iterators CHiLL Domain Poly Graphs Scripts Maps Sets Trees Non-affine (SPF) Non-affine Poly (SPF) Tasks Mappings CnC graphs Arrays Poly Tuners Mappings Controls Tree rewriting AST Pointers Low Level Possible to lower 21 21
Recommend
More recommend