A Parallel Numerical Solver Using Hierarchically Tiled Using Hierarchically Tiled Arrays James Brodman, G. Carl Evans, Murat Manguoglu, Ahmed Sameh, María J. g g Garzarán, and David Padua
Outline • Motivation • Hierarchically Tiled Arrays (HTAs) • Hierarchically Tiled Arrays (HTAs) • SPIKE • HTA SPIKE • HTA-SPIKE • Conclusions 2
Motivation • Still much room for advancement in parallel programming • Want to facilitate parallel programming but still provide Want to facilitate parallel programming but still provide control over performance – Raise the level of abstraction • Higher level data parallel operators and their associated aggregate data Hi h l l d t ll l t d th i i t d t d t types • Write programs as a sequence of operators – Can easily reason about the resulting program Can easily reason about the resulting program – Portable across any platform that implements the desired operators – Control performance through: • Data Parallelism D t P ll li • Hierarchical Tiling – Who should tile? 3
Tiling and Compilers (Matrix Multiplication) Intel MKL Clearly, the Compiler LOPS could do better at tiling could do better at tiling 20x 20x icc ‐ O3 ‐ xT MF icc ‐ O3 Matrix Size 4
Hierarchically Tiled Array • The Hierarchically Tiled Array, or HTA, is a data type based on higher level data parallel operators – Fortran 90 array operators Fortran 90 array operators – Hierarchical Tiling • Makes Tiles first class objects – Recognizes importance of tiling to control: R i i f ili l • Locality • Data Distribution – Referenced explicitly, not implicitly generated by the compiler Referenced explicitly not implicitly generated by the compiler – Operators extended to function on tiles – C++ Library Implementations: – Distributed-Memory using MPI Di t ib t d M i MPI – Shared-Memory using TBB 5
HTAs, Illustrated Cluster Memory Hierarchy Cluster L2 Node Node Multicore L1 Cache Register 6
HTA Indexing A Flattened Flattened A[ A[0,3] ] access A(0,0) A(0,1)[0,1] Hybrid ( , )[ , ] y Tile access Tile access access A(1, 0:1) Simplified syntax 7
HTA Assignment lhs rhs lhs = rhs; lhs(:,:) = rhs(:,:); lh (0 lhs(0,:) = rhs(1,:); ) h (1 ) lhs(0,:)[1,:] = rhs(1,:)[0,:]; 8
HTA Assignment lhs rhs lhs = rhs; lhs(:,:) = rhs(:,:); lh (0 lhs(0,:) = rhs(1,:); ) h (1 ) lhs(0,:)[1,:] = rhs(1,:)[0,:]; Assignment Communication 9
Higher Level HTA Operations • Element ‐ by ‐ element operations – +, ‐ ,*,/,… , , ,/,… • Reductions • Transpositions Transpositions – Tile – Data • MapReduce • Matrix Multiplication p • FFT • etc. etc. 10
User-Defined Operations • Programmers can create new complex parallel operators through the primitive hmap p g p p – Applies user defined operators to each tile of the HTA • And corresponding tiles if multiple HTAs are involved – Operator applied in parallel across tiles – Assumes operation on tiles is independent F() F() HTA a,b; b = a.map( F() ); F() F() F() F() b a 11
SPIKE • Parallel solver for banded linear systems of equations • SPIKE family of algorithms originally by Prof. Ahmed Sameh Sameh • http://software.intel.com/en-us/articles/intel-adaptive-spike- based-solver/ based solver/ • SPIKE’s blocks map naturally to tiles in HTA SPIKE s blocks map naturally to tiles in HTA 12
SPIKE Algorithm • Name derived from the matrix formed by multiplying the original block tri-diagonal system by the inverse of the original block tri diagonal system by the inverse of the blocks of A 13
SPIKE Algorithm 2 • Most SPIKE algorithms proceed as follows: 1. 1. Compute the Spike matrix, S, in parallel Compute the Spike matrix, S, in parallel • Compute LU factorization and solve instead of directly computing inverse • • LAPACK routines DGBTRF/DGBTRS LAPACK routines DGBTRF/DGBTRS 2. Form a reduced system of much smaller size 3. Solve the reduced system directly y y 4. Retrieve the global solution in parallel 14
SPIKE Reduced System • Extracts smaller, independent subsystem from the original system system 15
SPIKE TU and TA • SPIKE has several algorithmic variants • Recent focus has been on the variants referred to as • Recent focus has been on the variants referred to as “Truncated” • Faster than the full SPIKE algorithm Faster than the full SPIKE algorithm – Reduces computation and has greater parallelism – Only applicable to diagonally dominant systems y pp g y y • Can wrap with an iterative method for non-diagonally dominant systems 16
Truncated Savings 1 17
Truncated Savings 2 18
TU and TA The truncated variants come in several flavors the the two that we implemented are TU and TA – T – truncated – U – Uses LU and UL factorizations – A – Uses alternating LU and UL factorizations The difference is in how the work is distributed to the processors not in what work is done. 19
TA Work Distribution 20
TU Work Distribution 21
HTA SPIKE • Blocks of SPIKE map very well to HTA tiles • Factorization and Solving of tiles parallelized with hmap • Factorization and Solving of tiles parallelized with hmap • Reduced System formed using array assignments HTA Representation Original Matrix System BC = × = LUA= ULA= r = 22
TU Implementation ... LUA.hmap(factorize_lua()); // factorize blocks of A ULA.hmap(factorize_ula()); // calculate the spike tips W(t) and V(b) from Bs and Cs BC.hmap(solve_bc(),LUA,ULA); // update right hand side g.hmap(solve_lua(),LUA); // form the reduced system REDUCED()[0:m-1,m:2*m-1] = BC(0:blks-2)[0:m-1,0:m-1]; () ( ) REDUCED()[m:2*m-1,0:m-1] = BC(1:blks-1)[0:m-1,m:2*m-1]; // form the reduced system RHS greduced()[0:m-1] = g(0:blks-2)[blocksize-m:blocksize-1]; greduced()[m:2*m-1] = g(1:blks-1)[0:m-1]; // factorize the reduced system // factorize the reduced system REDUCED.hmap(factorize()); REDUCED h (f t i ()) // solve the reduced system greduced.hmap(solve(),REDUCED); // Update RHS as r = r - Bz - Cz fv = r(0:num_blocks-2); fr_half = greduced()[0:m-1]; B map(dgemv() fv fr half); B.map(dgemv(),fv,fr_half); r(0:num_blocks-2) = fv; fw = r(1:num_blocks-1); fr_half = greduced()[m:2*m-1]; C.map(dgemv(),fw, fr_half); r(1:num_blocks-1) = fw; // S l // Solve the updated system th d t d t r.hmap(solve_lua(),LUA); ... 23
Portability • The HTA programming model expresses programs as a sequence of higher level data parallel operators sequence of higher level data parallel operators – Portable as long as the appropriate implementations of the operators exist for the target platform – For SPIKE, array assignments, tile indexing, and hmap. • The code on the last slide runs on both the MPI and TBB i implementations of the HTA library l t ti f th HTA lib – Simply change which header to include. 24
SPIKE TU Performance 10 9 8 ntial MKL 7 6 eedup vs Sequen 5 4 Sp 3 2 1 0 0 5 10 15 20 25 30 35 Number of CPUs HTA SHM HTA MPI SpikePACK 32 Core Intel Xeon 7555 (Nehalem) 25 @ 1.86 GHz
Cluster TU Performance 16 MKL 14 equential 12 10 dup vs Se 8 8 6 4 Speed 2 0 0 0 5 5 10 10 15 15 20 20 25 25 30 30 35 35 Number of CPUs HTA MPI SpikePACK Intel Xeon X5550 (Nehalem) @ 2.66 26 GHz
SPIKE TA Performance 10 MKL 9 8 8 equential 7 6 dup vs Se 5 5 4 3 2 2 Spee 1 0 0 10 20 30 40 Number of CPUs HTA SHM HTA MPI SpikePACK 32 Core Intel Xeon 7555 (Nehalem) 27 @ 1.86 GHz
Cluster TA Performance 10 MKL 9 8 8 equential 7 6 dup vs Se 5 5 4 3 2 2 Spee 1 0 0 10 20 30 40 Number of CPUs HTA MPI SpikePACK Intel Xeon X5550 (Nehalem) @ 2.66 28 GHz
Conclusions • The SPIKE algorithms are a good fit for the HTA programming model – First class tiles – Data parallel operators – Portable • Clean compact code • Clean, compact code – Our implementation takes ~50 lines of code – SpikePACK’s Fortran+MPI? …..a lot more. – A challenge – one of the most complex algorithms we’ve looked at using HTAs • Good performance – Original goal was only to match performance of Fortran+MPI Original goal was only to match performance of Fortran+MPI 29
Future Work • Sparse Tiling – Allow programmers to express the structure of the original system Allow programmers to express the structure of the original system • Support for Heterogeneous Architectures – Clusters of Multicores, GPUs C uste s o u t co es, G Us • Look at other complex algorithms • Linear algebra libraries and data layout Linear algebra libraries and data layout 30
Questions? 31
Multithreaded MKL Performance 8 8 7 7 6 6 # of Processors 5 5 MKL 4 4 3 3 2 2 1 1 2.5 2 2 1.5 1 0.5 0 5 0 edup Spee 32
Why is Raising the Level of Abstraction Good? 1) Allows programmers to focus on the algorithm rather than on the implementation p 2) Data parallel programs resemble ) g conventional, sequential programs – Parallelism is encapsulated – Parallelism is structured P ll li i t t d 3) Compact and readable 3) Compact and readable 33
Recommend
More recommend