Sparse linear systems: results SunUltraSparc−IIe Intel PentiumIII Single prec. su Single prec. su 2 Speedup wrt double precision Speedup wrt double precision 5 Mixed prec. su Mixed prec. su 4 2 2 4 2 2 3 4 3 2 9 5 20 2 3 2 9 2 2 2 3 20 5 2 3 1 1 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Matrix n. Matrix n. AMD Opteron 246 Intel Woodcrest Single prec. su Single prec. su 2 Speedup wrt double precision Speedup wrt double precision Mixed prec. su Mixed prec. su 2 2 4 4 6 2 5 3 2 20 2 2 2 1.5 2 10 1.5 20 10 2 5 2 4 1 1 0.5 0.5 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Matrix n. Matrix n. RAIM 2011, Perpignan 20/70
Sparse linear systems: memory When using sparse direct solvers, the original matrix and the factors are stored separately Matrix N NNZ NNZ in LU 1 G64 7000 82918 4889782 2 Si10H16 17077 875923 32888553 3 c-71 76638 859554 21077836 4 cage11 39082 559722 60016690 5 dawson5 51537 1010777 7547980 6 nasasrb 54870 2677324 15682155 7 poisson3Db 85623 2374949 74907857 8 rma10 46835 2374001 13617819 9 s3rmt3m1 5489 112505 847139 10 wang4 26068 177196 11708654 RAIM 2011, Perpignan 21/70
Sparse linear systems: memory In mixed precisions iterref factors are stored in SP instead of DP Memory in MUMPS iterref relative memory consumption 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 matrix number RAIM 2011, Perpignan 22/70
Outline Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization RAIM 2011, Perpignan 23/70
Sparse, iterative solvers Iterref and Richardson The iterative refinement x k +1 = x k + M ( b − Ax k ) with M = ( LU ) − 1 P , is equivalent to the Richardson iteration on MAx = Mb where M is called left preconditioner We can go three steps further: 1. use right preconditioning: AMu = b, x = Mu 2. replace M with one step if an iterative method on A in single precision 3. use a Krylov method for the outer iteration in double precision RAIM 2011, Perpignan 24/70
Sparse, iterative solvers The result is mixed-precision, inner-outer iteration, of the type FGMRES-GMRES: 1: for i = 0 , 1 , ... do 2: r = b − Ax i ( ε d ) 3: β = h 1 , 0 = || r || 2 ( ε d ) 4: check convergence and exit if done 5: for k = 1 , . . . , m out do 6: v k = r / h k,k − 1 ( ε d ) 7: Perform one cycle of GMRES SP ( m in ) in order to (approximately) solve Az k = v k , (initial guess z k = 0 ) ( ε s ) 8: r = A z k ( ε d ) 9: for j=1,. . . ,k do h j,k = r T v j 10: ( ε d ) 11: r = r − h j,k v j ( ε d ) 12: end for 13: h k +1 ,k = || r || 2 ( ε d ) 14: end for 15: Define Z k = [ z 1 , . . . , z k ] , H k = { h i,j } 1 ≤ i ≤ k +1 , 1 ≤ j ≤ k ( ε d ) 16: Find y k , the vector of size k , that minimizes || βe 1 − H k y k || 2 ( ε d ) 17: x i +1 = x i + Z k y k ( ε d ) 18: end for RAIM 2011, Perpignan 25/70
Sparse, iterative solvers: results n. Matrix N NNZ Sym PD κ O (10 3 ) 1 SiO 33401 1317655 yes no O (10 5 ) 2 Lin 25600 1766400 yes no 3 c-71 76638 859554 yes no O (10) 4 cage-11 39082 559722 no no O (1) 5 raefsky3 21200 1488768 no no O (10) O (10 3 ) 6 poisson3Db 85623 2374949 no no RAIM 2011, Perpignan 26/70
Sparse, iterative solvers: results SP-DP / DP-DP SP-DP/DP 6 Intel Woodcrest 3.0 GHz 2.5 AMD Opteron 246 2.0 GHz Intel Woodcrest 3.0 GHz IBM PowerPC 970 2.5 GHz AMD Opteron 246 2.0 GHz 5 IBM PowerPC 970 2.5 GHz 2 4 speedup speedup 1.5 3 1 2 0.5 1 0 0 1 2 3 4 5 6 matrix number 1 2 3 4 5 6 matrix number RAIM 2011, Perpignan 27/70
Outline Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization RAIM 2011, Perpignan 28/70
Coding for multicores Fine Granularity • as parallelism degree increases, finer granularity is needed • data should be split in pieces that fit in the small local memories Asynchronicity • higher parallelism degree suffers from strict synchronizations • asynchronous, dynamic execution can be used to hide the latency of access to memory Data Locality • the gap between CPU power and communication speed is increasing and, thus, it is essential to reduce data movement RAIM 2011, Perpignan 29/70
How to achieve the properties? We must operate the transition to Parallel Algorithms parallelism LAPACK LAPACK parallelism thread 1 thread n MT−BLAS thread 1 thread n ST−BLAS ST−BLAS RAIM 2011, Perpignan 30/70
Outline Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization RAIM 2011, Perpignan 31/70
LAPACK algorithms Panel reduction: a number of transformations are computed on a small portion of the matrix called “panel” . These are Level-2 BLAS operations. Trailing submatrix update: the transformations computed on the panel are applied to the rest of the matrix, i.e., the “trailing submatrix” . These are Level-3 BLAS operations. RAIM 2011, Perpignan 32/70
LAPACK algorithms Panel reduction: a number of transformations are computed on a small portion of the matrix called “panel” . These are Level-2 BLAS operations. Trailing submatrix update: the transformations computed on the panel are applied to the rest of the matrix, i.e., the “trailing submatrix” . These are Level-3 BLAS operations. RAIM 2011, Perpignan 32/70
LAPACK algorithms Idle time can be due to sequential tasks and load unbalance . RAIM 2011, Perpignan 33/70
Fine Granularity: parallel tiled algorithms Parallel tiled algorithms: fine granularity (of data and operations) can be achieved by partitioning a matrix A of size m × n into tiles of size b × b : A 11 A 12 · · · A 1 s A 21 A 22 · · · A 2 s [ A ] = . . . ... . . . . . . A r 1 A r 2 · · · A rs Algorithms are rewritten in such a way that tiles are operands of elementary operations. RAIM 2011, Perpignan 34/70
Fine Granularity: parallel tiled algorithms There are cases where the “tiling” can be done starting from the LAPACK algorithm. A → L = chol ( A ) DPOTRF: L, A → A = AL − T DTRSM: B, A → A = A − BB T DSYRK: for k = 1 , nb + 1 , 2 ∗ nb + 1 , ... do DPOTRF ( A ( ) ) k : k + b − 1 , k : k + b − 1 DTRSM ( L ( ) ) k : k + b − 1 , k : k + b − 1 ) , A ( k + b : n, k : k + b − 1 DSYRK ( L ( ) ) k + b : n, k : k + b − 1 ) , A ( k + b : n, k + b : n end for RAIM 2011, Perpignan 35/70
Fine Granularity: parallel tiled algorithms There are cases where the “tiling” can be done starting from the LAPACK algorithm. for k = 1 , 2 , ..., s do DPOTRF ( A kk ) for i = k + 1 , k + 2 , ..., s do DTRSM ( L kk , A ik ) end for for i = k + 1 , k + 2 , ..., s do for j = k + 1 , k + 2 , ..., s do DSYRK ( L ik , L jk , A ij ) end for end for end for RAIM 2011, Perpignan 35/70
Fine Granularity: parallel tiled algorithms Things get much more complicated for LU... DGETRF : A kk → L kk , U kk , P kk = LU ( A kk ) A kj , L kk , P kk → U kj = L − 1 DGESSM : kk P kk A kj � U kk � U kk � � DTSTRF : → U kk , L ik , P ik = LU A ik A ik � U kj � U kj � U kj � � � = L − 1 DSSSSM : , L ik , P ik → ik P ik A ij A ij A ij RAIM 2011, Perpignan 36/70
Fine Granularity: parallel tiled algorithms ... and QR DGEQRT : A kk − → ( V kk , R kk , T kk ) = QR ( A kk ) → R kj = ( I − V kk T kk V T DLARFB : A kj , V kk , T kk − kk ) A kj � R kk � R kk � � DTSQRT : − → ( V ik , T ik , R kk ) = QR A ik A ik � R kj � R kj � R kj � � � = ( I − V ik T ik V T DSSRFB : , V ik , T ik − → ik ) A ij A ij A ij RAIM 2011, Perpignan 37/70
Fine Granularity: parallel tile algorithms LU: QR: for k = 1 , 2 ..., min ( p, q ) do for k = 1 , 2 ..., min ( p, q ) do DGETRF ( A kk , L kk , U kk , P kk ) DGEQRT ( A kk , L kk , U kk , P kk ) for j = k + 1 , k + 2 , ..., q do for j = k + 1 , k + 2 , ..., q do DGESSM ( A kj , L kk , P kk , U k j ) DLARFB ( A kj , L kk , P kk , U k j ) end for end for for i = k + 1 , k + 1 , ..., p do for i = k + 1 , k + 1 , ..., p do DTSTRF ( U kk , A ik , P ik ) DTSQRT ( U kk , A ik , P ik ) for j = k + 1 , k + 2 , ..., q do for j = k + 1 , k + 2 , ..., q do DSSSSM ( U kj , A ij , L ik , P ik ) DSSRFB ( U kj , A ij , L ik , P ik ) end for end for end for end for end for end for RAIM 2011, Perpignan 38/70
Fine Granularity: parallel tile algorithms RAIM 2011, Perpignan 39/70
Fine Granularity: parallel tile algorithms RAIM 2011, Perpignan 39/70
Fine Granularity: parallel tile algorithms RAIM 2011, Perpignan 39/70
Fine Granularity: parallel tile algorithms RAIM 2011, Perpignan 39/70
Fine Granularity: parallel tile algorithms Task The execution of an elementary operation on one (or a few) tiles. � fine granularity: tiles can be of arbitrary small size as long as sequential BLAS routines are efficient on them � many tasks: tiles are typically of size ∼ 100 ; for a matrix of size 10000 we have more than 300000 tasks!!!. � complex dependencies: how to exploit them in order to maximize parallelism??? Data-Flow Parallelism RAIM 2011, Perpignan 40/70
Data-flow parallel programming Data-flow programming model [Silc et al. 97] An instruction is said to be executable when all the input operands that are necessary for its execution are available to it. The instruction for which this condition is satisfied is said to be fired. The effect of Dependency an instruction is the consumption of its input values and generation of output values... As a result, a dataflow program can be represented as a directed graph consisting of named nodes, which represent instructions, and arcs, which represent data dependencies among instructions. RAIM 2011, Perpignan 41/70
Asynchronicity: dynamic DAG driven execution • nodes are tasks that operate on tiles • edges are dependencies among nodes Tasks can be scheduled asynchronously in any order as long as the dependencies are not violated. • dynamic execution flow • latency hiding • fewer synchro • no idle times • scheduling policies RAIM 2011, Perpignan 42/70
Asynchronicity: dynamic DAG driven execution • nodes are tasks that operate on tiles • edges are dependencies among nodes Tasks can be scheduled asynchronously in any order as long as the dependencies are not violated. • dynamic execution flow • latency hiding • fewer synchro • no idle times • scheduling policies RAIM 2011, Perpignan 42/70
Asynchronicity: dynamic DAG driven execution Data-flow programming model [Silc et al. 97] Due to the above rule the model is asynchronous. It is also self-scheduling since instruction sequencing is constrained only by data dependencies. RAIM 2011, Perpignan 43/70
Block Data Layout Fine-granularity raises the need for novel data storage format in order to overcome the limitations of BLAS routines on small chunks of data RAIM 2011, Perpignan 44/70
Block Data Layout Fine-granularity raises the need for novel data storage format in order to overcome the limitations of BLAS routines on small chunks of data RAIM 2011, Perpignan 44/70
Performance results Cholesky −− quad−socket, dual−core Opteron 70 PLASMA & ACML BLAS ACML Cholesky 60 MKL Cholesky LAPACK & ACML BLAS 50 Gflop/s 40 30 20 10 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 problem size RAIM 2011, Perpignan 45/70
Performance results QR −− quad−socket, dual−core Opteron 70 PLASMA & ACML BLAS ACML QR 60 MKL QR LAPACK & ACML BLAS 50 Gflop/s 40 30 20 10 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 problem size RAIM 2011, Perpignan 46/70
Performance results LU −− quad−socket, dual−core Opteron 70 PLASMA & ACML BLAS ACML LU 60 MKL LU LAPACK & ACML BLAS 50 Gflop/s 40 30 20 10 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 problem size RAIM 2011, Perpignan 47/70
Stability of GETWP Gaussian Elimination with Tiled Pairwise Pivoting: N = ( L p,p P p,p . . . L 2 ,p P 2 ,p . . . L 2 , 2 P 2 , 2 L 1 ,p P 1 ,p . . . L 1 , 1 P 1 , 1 ) − 1 → A = NU Three main differences occur between L from GEPP and N from GETWP: 1. N is the combination of permutations and elementary transformations, the effect of which can not be dissociated in two matrices L and P as it is for GEPP , 2. although we need n ( n − 1) / 2 elements to store all the ( L i,j ) ’s, the matrix N is not unit lower triangular and has in general a lot more than n ( n − 1) / 2 entries, 3. the absolute values of the entries of the off-diagonal elements of N can be greater than 1 and in practice they are notably larger. RAIM 2011, Perpignan 48/70
Stability of GETWP Gaussian Elimination with Tiled Pairwise Pivoting: N = ( L p,p P p,p . . . L 2 ,p P 2 ,p . . . L 2 , 2 P 2 , 2 L 1 ,p P 1 ,p . . . L 1 , 1 P 1 , 1 ) − 1 → A = NU Backward error for the factorization and the solution RAIM 2011, Perpignan 48/70
Stability of GETWP Gaussian Elimination with Tiled Pairwise Pivoting: N = ( L p,p P p,p . . . L 2 ,p P 2 ,p . . . L 2 , 2 P 2 , 2 L 1 ,p P 1 ,p . . . L 1 , 1 P 1 , 1 ) − 1 → A = NU � � � � � � � � � y − Ax wp � ∞ � y − Ax pp � ∞ � A − N wp U wp � ∞ � P A − L pp U pp � ∞ / / � A � ∞ � x wp � ∞ � A � ∞ � x pp � ∞ � A � ∞ � A � ∞ RAIM 2011, Perpignan 48/70
Outline Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization RAIM 2011, Perpignan 49/70
9 8 4 7 3 5 6 2 1 Multifrontal QR 1. Analysis: symbolic computations to reduce fill-in, compute elimination tree, symbolic factorization, estimates etc. 2. Factorization: compute the Q and R factors 3. Solve: use Q and R to compute the solution of a problem (e.g. min x � Ax − b � 2 ) RAIM 2011, Perpignan 50/70
Multifrontal QR 2. Factorization: compute the Q and R factors • a dense frontal matrix is associated to each node. Every frontal matrix contains a number of fully 9 assembled rows that deliver a piece of the global R factor 8 • the tree is processed bottom-up and at each node: 4 7 1. Assembly: the contribution blocks from the 3 children nodes are assembled into the frontal 5 6 matrix 2. Factorization: k eliminations are done trough 2 1 partial or full factorization of the frontal matrix RAIM 2011, Perpignan 50/70
The Multifrontal QR: front factorization • the frontal matrix is assumed to be sparse and permuted into a staircase structure • a complete Level-3 BLAS QR factorization of the frontal matrix is computed • the result is ◦ k rows of the global R factor ◦ a number of reflectors used to build the Q factor ◦ a contibution block to be assembled into the father node RAIM 2011, Perpignan 51/70
The Multifrontal QR: front assembly Frontal matrix assembly is achieved in two steps: RAIM 2011, Perpignan 52/70
The Multifrontal QR: front assembly Frontal matrix assembly is achieved in two steps: 1. contribution blocks are simply appended at the bottom of the father front RAIM 2011, Perpignan 52/70
The Multifrontal QR: front assembly Frontal matrix assembly is achieved in two steps: 1. contribution blocks are simply appended at the bottom of the father front 2. a row permutation must be computed to restore a staircase structure RAIM 2011, Perpignan 52/70
Parallelism As for the Cholesky, LU , LDL T multifrontal method, two levels of parallelism can be exploited: Tree parallelism • fronts associated to nodes in different branches are independent and can, thus, be factorized in parallel Front parallelism • if the size of a front is big enough, the front can be factorized in parallel RAIM 2011, Perpignan 53/70
Parallelism: classical approach The classical approach (Puglisi, Matstom, Davis) • Tree parallelism: ◦ a front assembly+factorization corresponds to a task ◦ computational tasks are added to a task pool ◦ threads fetch tasks from the pool repeatedly until all the fronts are done • Node parallelism: ◦ Multithreaded BLAS for the front facto What’s wrong with this approach? A complete separation of the two levels of parallelism which causes • potentially strong load unbalance • heavy synchronizations due to the sequential nature of some operations (assembly) • sub-optimal exploitation of the concurrency in the multifrontal method RAIM 2011, Perpignan 54/70
Parallelism: classical approach Node parallelism grows towards the root Tree parallelism grows towards the leaves RAIM 2011, Perpignan 55/70
Parallelism: a new approach fine-grained, data-flow parallel approach • fine granularity: tasks are not defined as operations on fronts but as operations on portions of fronts defined by a 1-D partitioning • data flow parallelism: tasks are scheduled dynamically based on the dependencies between them Both node and tree parallelism are handled the same way at any level of the tree. RAIM 2011, Perpignan 56/70
Parallelism: a new approach Fine-granularity is achieved through a 1-D block partitioning of fronts and the definition of five elementary operations: 1. activate(front) : the activation of a front corresponds to a full determination of its (staircase) structure and allocation of the needed memory areas 2. panel(bcol) : QR factorization (Level2 BLAS) of a column 3. update(bcol) : update of a column in the trailing submatrix wrt to a panel 4. assemble(bcol) : assembly of a column of the contribution block into the father 5. clean(front) : cleanup the front in order to release all the memory areas that are no more needed RAIM 2011, Perpignan 57/70
Parallelism: a new approach RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual • updates can be applied to each column separately RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual • updates can be applied to each column separately RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual • updates can be applied to each column separately RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual • updates can be applied to each column separately RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual • updates can be applied to each column separately RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual • updates can be applied to each column separately RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual • updates can be applied to each column separately RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual • updates can be applied to each column separately RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual • updates can be applied to each column separately • Dependency rule #1: a panel can be factorized as soon as it is updated wrt the previous step ( lookahead ). Early panel factorizations results in more updates in a “ready” state and, thus, more parallelism RAIM 2011, Perpignan 58/70
Parallelism: a new approach • a frontal matrix is 1-D partitioned into block-columns • panels are factorized as usual • updates can be applied to each column separately • Dependency rule #1: a panel can be factorized as soon as it is updated wrt the previous step ( lookahead ). Early panel factorizations results in more updates in a “ready” state and, thus, more parallelism • Dependency rule #2: a column can be updated wrt a panel if it is up to date wrt all previous panels RAIM 2011, Perpignan 58/70
Parallelism: a new approach A look at the whole tree: RAIM 2011, Perpignan 59/70
Parallelism: a new approach A look at the whole tree: • fronts must be activated: the structure is computed and memory is allocated RAIM 2011, Perpignan 59/70
Parallelism: a new approach A look at the whole tree: • fronts must be activated: the structure is computed and memory is allocated RAIM 2011, Perpignan 59/70
Parallelism: a new approach A look at the whole tree: • fronts must be activated: the structure is computed and memory is allocated RAIM 2011, Perpignan 59/70
Parallelism: a new approach A look at the whole tree: • fronts must be activated: the structure is computed and memory is allocated • Dependency rule #3: a node can be activated only if all of its children are already active RAIM 2011, Perpignan 59/70
Parallelism: a new approach A look at the whole tree: • fronts must be activated: the structure is computed and memory is allocated • Dependency rule #3: a node can be activated only if all of its children are already active RAIM 2011, Perpignan 59/70
Parallelism: a new approach A look at the whole tree: • fronts must be activated: the structure is computed and memory is allocated • Dependency rule #3: a node can be activated only if all of its children are already active • Dependency rule #4: a column can be assembled into the father, if it is up-to-date wrt all the preceding panels and the father is active RAIM 2011, Perpignan 59/70
Parallelism: a new approach A look at the whole tree: • fronts must be activated: the structure is computed and memory is allocated • Dependency rule #3: a node can be activated only if all of its children are already active • Dependency rule #4: a column can be assembled into the father, if it is up-to-date wrt all the preceding panels and the father is active • Dependency rule #5: a column can be treated if it is fully assembled regardless of the status of the rest of the tree RAIM 2011, Perpignan 59/70
Recommend
More recommend