Triangular Systems Parallel Algorithms Wavefront Algorithms Parallel Numerical Algorithms Chapter 3 – Dense Linear Systems Section 3.3 – Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 42
Triangular Systems Parallel Algorithms Wavefront Algorithms Outline Triangular Systems 1 Parallel Algorithms 2 Wavefront Algorithms 3 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 42
Triangular Systems Parallel Algorithms Wavefront Algorithms Triangular Matrices Matrix L is lower triangular if all entries above its main diagonal are zero, ℓ ij = 0 for i < j Matrix U is upper triangular if all entries below its main diagonal are zero, u ij = 0 for i > j Triangular matrices are important because triangular linear systems are easily solved by successive substitution Most direct methods for solving general linear systems first reduce matrix to triangular form and then solve resulting equivalent triangular system(s) Triangular systems are also frequently used as preconditioners in iterative methods for solving linear systems Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 42
Triangular Systems Parallel Algorithms Wavefront Algorithms Forward Substitution For lower triangular system Lx = b , solution can be obtained by forward substitution i − 1 � � � x i = b i − ℓ ij x j /ℓ ii , i = 1 , . . . , n j =1 for j = 1 to n x j = b j /ℓ jj { compute soln component } for i = j + 1 to n b i = b i − ℓ ij x j { update right-hand side } end end Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 42
Triangular Systems Parallel Algorithms Wavefront Algorithms Back Substitution For upper triangular system Ux = b , solution can be obtained by back substitution n � � � x i = b i − u ij x j /u ii , i = n, . . . , 1 j = i +1 for j = n to 1 x j = b j /u jj { compute soln component } for i = 1 to j − 1 b i = b i − u ij x j { update right-hand side } end end Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 42
Triangular Systems Parallel Algorithms Wavefront Algorithms Solving Triangular Systems Forward or back substitution requires about n 2 / 2 multiplications and similar number of additions, so serial exeuction time is T 1 = Θ( γn 2 ) We will consider only lower triangular systems, as analogous algorithms for upper triangular systems are similar Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 42
Triangular Systems Parallel Algorithms Wavefront Algorithms Loop Orderings for Forward Substitution for j = 1 to n for i = 1 to n x j = b j /ℓ jj for j = 1 to i − 1 for i = j + 1 to n b i = b i − ℓ ij x j b i = b i − ℓ ij x j end end x i = b i /ℓ ii end end right-looking left-looking immediate-update delayed-update data-driven demand-driven fan-out fan-in Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms Parallel Algorithm Partition For i = 2 , . . . , n , j = 1 , . . . , i − 1 , fine-grain task ( i, j ) stores ℓ ij and computes product ℓ ij x j For i = 1 , . . . , n , fine-grain task ( i, i ) stores ℓ ii and b i , collects sum t i = � i − 1 j =1 ℓ ij x j , and computes and stores x i = ( b i − t i ) /ℓ ii yielding 2-D triangular array of n ( n + 1) / 2 fine-grain tasks Communicate For j = 1 , . . . , n − 1 , task ( j, j ) broadcasts x j to tasks ( i, j ) , i = j + 1 , . . . , n For i = 2 , . . . , n , sum reduction of products ℓ ij x j across tasks ( i, j ) , j = 1 , . . . , i , with task ( i, i ) as root Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms Fine-Grain Tasks and Communication ℓ 11 b 1 x 1 ℓ 22 ℓ 21 b 2 x 2 ℓ 33 ℓ 31 ℓ 32 b 3 x 3 ℓ 44 ℓ 41 ℓ 42 ℓ 43 b 4 x 4 ℓ 55 ℓ 51 ℓ 52 ℓ 53 ℓ 54 b 5 x 5 ℓ 66 ℓ 61 ℓ 62 ℓ 63 ℓ 64 ℓ 65 b 6 x 6 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms Fine-Grain Parallel Algorithm if i = j then t = 0 if i > 1 then recv sum reduction of t across tasks ( i, k ) , k = 1 , . . . , i end x i = ( b i − t ) /ℓ ii broadcast x i to tasks ( k, i ) , k = i + 1 , . . . , n else recv broadcast of x j from task ( j, j ) t = ℓ ij x j reduce t across tasks ( i, k ) , k = 1 , . . . , i end Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms Fine-Grain Algorithm If communication is suitably pipelined, then fine-grain algorithm can achieve Θ( n ) execution time, but uses Θ( n 2 ) tasks, so it is inefficient If there are multiple right-hand-side vectors b , then successive solutions can be pipelined to increase overall efficiency Agglomerating fine-grain tasks yields more reasonable number of tasks and improves ratio of computation to communication Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms Agglomeration Agglomerate With n × n array of fine-grain tasks, natural strategies are 2-D: combine k × k subarray of fine-grain tasks to form each coarse-grain task, yielding ( n/k ) 2 coarse-grain tasks 1-D column: combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms 2-D Agglomeration ℓ 11 b 1 x 1 ℓ 22 ℓ 21 b 2 x 2 ℓ 33 ℓ 31 ℓ 32 b 3 x 3 ℓ 44 ℓ 41 ℓ 42 ℓ 43 b 4 x 4 ℓ 55 ℓ 51 ℓ 52 ℓ 53 ℓ 54 b 5 x 5 ℓ 66 ℓ 61 ℓ 62 ℓ 63 ℓ 64 ℓ 65 b 6 x 6 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms 1-D Column Agglomeration ℓ 11 b 1 x 1 ℓ 22 ℓ 21 b 2 x 2 ℓ 33 ℓ 31 ℓ 32 b 3 x 3 ℓ 44 ℓ 41 ℓ 42 ℓ 43 b 4 x 4 ℓ 55 ℓ 51 ℓ 52 ℓ 53 ℓ 54 b 5 x 5 ℓ 66 ℓ 61 ℓ 62 ℓ 63 ℓ 64 ℓ 65 b 6 x 6 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms 1-D Row Agglomeration ℓ 11 b 1 x 1 ℓ 22 ℓ 21 b 2 x 2 ℓ 33 ℓ 31 ℓ 32 b 3 x 3 ℓ 44 ℓ 41 ℓ 42 ℓ 43 b 4 x 4 ℓ 55 ℓ 51 ℓ 52 ℓ 53 ℓ 54 b 5 x 5 ℓ 66 ℓ 61 ℓ 62 ℓ 63 ℓ 64 ℓ 65 b 6 x 6 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms Mapping Map 2-D: assign ( n/k ) 2 /p coarse-grain tasks to each of p processors using any desired mapping in each dimension, treating target network as 2-D mesh 1-D: assign n/p coarse-grain tasks to each of p processors using any desired mapping, treating target network as 1-D mesh Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms 1-D Column Agglomeration, Block Mapping ℓ 11 b 1 x 1 ℓ 22 ℓ 21 b 2 x 2 ℓ 33 ℓ 31 ℓ 32 b 3 x 3 ℓ 44 ℓ 41 ℓ 42 ℓ 43 b 4 x 4 ℓ 55 ℓ 51 ℓ 52 ℓ 53 ℓ 54 b 5 x 5 ℓ 66 ℓ 61 ℓ 62 ℓ 63 ℓ 64 ℓ 65 b 6 x 6 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms 1-D Column Agglomeration, Cyclic Mapping ℓ 11 b 1 x 1 ℓ 22 ℓ 21 b 2 x 2 ℓ 33 ℓ 31 ℓ 32 b 3 x 3 ℓ 44 ℓ 41 ℓ 42 ℓ 43 b 4 x 4 ℓ 55 ℓ 51 ℓ 54 ℓ 52 ℓ 53 b 5 x 5 ℓ 66 ℓ 61 ℓ 64 ℓ 62 ℓ 65 ℓ 63 b 6 x 6 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 42
Triangular Systems Fine-Grain Algorithm Parallel Algorithms 2-D Algorithm Wavefront Algorithms 1-D Block-Cyclic Algorithm Execution Time With block-size b , 1D partitioning requires n/b broadcasts of b -items for row-agglomeration requires n/b reductions of b -items for column-agglomeration in both cases O ( b 2 ) work must be done to solve for b entries of x between each of the n/b collectives The overall execution time is � � α ( n/b ) log( p ) + βn + γ ( n 2 /p + nb ) T p ( n, b ) = Θ Selecting block-size b = n/p , parallel execution time is � � αp log( p ) + βn + γn 2 /p T p ( n, n/p ) = Θ Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 42
Recommend
More recommend