solution of dynamic solid deformation using hybrid
play

Solution of dynamic solid deformation using hybrid parallelization - PowerPoint PPT Presentation

Solution of dynamic solid deformation using hybrid parallelization with MPI and OpenMP MSc. Miguel Vargas-Flix ISUM 2012 1/24 Problem description Problem description We want to solve large scale dynamic problems with linear deformation


  1. Solution of dynamic solid deformation using hybrid parallelization with MPI and OpenMP MSc. Miguel Vargas-Félix ISUM 2012 1/24

  2. Problem description Problem description We want to solve large scale dynamic problems with linear deformation modeled with the finite element method. ( ∂ x ) ∂ 0 0 ∂ x ∂ 0 0 ∂ y ( u 3 ) ∂ u 1 0 0 ∂ z ε= u 2 ∂ ∂ 0 ∂ y ∂ x ∂ ∂ 0 ∂ z ∂ y ∂ ∂ 0 ∂ z σ= D ( ε−ε 0 ) +σ 0 Where u is the displacement vector, ε the stain, σ the stress. D is called the constitutive matrix. The solution is found using the finite element method with the Galerkin weighted residuals. 2/24

  3. Schur substructuring method Schur substructuring method This is a domain decomposition method without overlapping [Krui04]. Γ d j i Ω Γ f Finite element domain (left), domain discretization (center), partitioning (right). We start with a system of equations resulting from a finite element problem K d = f , (1) where K is a symmetric positive definite matrix of size n × n . If we divide the geometry into p partitions, the idea is to split the workload to let each partition to be handled by a computer in the cluster. 3/24

  4. Schur substructuring method ( BB ) II K 1 II IB K 1 0 0 K 1 BB K IB II IB K 1 0 K 2 0 K 2 II IB IB 0 0 K 3 K 3 K 2 II II K 3 K 2 BI BI BI IB K 1 K 2 K 3 K K 3 IB K 2 Figure 1. Partitioning example. We can arrange (reorder variables) of the system of equations to have the following form ( BB ) ( B ) = ( B ) II IB K 1 0 K 1 I I d 1 f 1 II IB K 2 K 2 I I d 2 f 2 II IB 0 K 3 K 3 I I d 3 f 3 . (2) ⋮ ⋮ ⋮ ⋱ ⋮ I I II IB d p f p K p K p BI BI BI BI d f ⋯ K 1 K 2 K 3 K p K The superscript II denotes entries that capture the relationship between nodes inside a partition. BB is used to indicate entries in the matrix that relate nodes on the boundary. Finally IB and BI are used for entries with values dependent of nodes in the boundary and nodes inside the partition. 4/24

  5. Schur substructuring method Thus, the system can be separated in p different systems, ( BB ) ( B ) = ( B ) , i = 1 … p . II IB I I K i K i d i f i BI K i K d f I as For each partition i the vector of unknowns d i − 1 ( f i I = ( K i II ) B ) . I − K i IB d (3) d i After applying Gaussian elimination by blocks on (2), the reduced system of equations becomes ( K IB ) d p p − 1 K i − 1 f i BI ( K i II ) BI ( K i II ) BB − ∑ B − ∑ B = f I . K i K i (4) i = 1 i = 1 I with (3). Once the vector d B is computed using (4), we can calculate the internal unknowns d i It is not necessary to calculate the inverse in (4). BI ( K i II ) − 1 K i Let’s define ̄ BB = K i IB , to calculate it [Sori00], we proceed column by column using an K i extra vector t , and solving for c = 1 … n II t = [ K i IB ] c , K i (5) note that many [ K i IB ] c are null. Next we can complete K i BB with, [ ̄ BB ] c = K i BI t . K i 5/24

  6. Schur substructuring method BI ( K i II ) Now lets define ̄ − 1 f i B = K i I , in this case only one system has to be solved f i II t = f i I , K i (6) and then BI t . ̄ B = K i f i Each ̄ BB and ̄ B holds the contribution of each partition to (4), this can be written as K i f i ( K BB ) d p p BB − ∑ B − ∑ ̄ ̄ B = f B , K i f i (7) i = 1 i = 1 once (7) is solved, we can calculate the inner results of each partition using (3). II is sparse and has to be solved many times in (5), a efficient way to proceed is to use a Since K i II . To reduce memory usage and increase speed a sparse Cholesky Cholesky factorization of K i factorization has to be implemented, this method is explained below. In case of (7), K BB is sparse, but ̄ BB are not. To solve this system of equations an sparse version K i of conjugate gradient was implemented, the matrix ( K BB −∑ i = 1 BB ) is not assembled, but ̄ p K i maintained distributed. 6/24

  7. Matrix storage Matrix storage An efficient method to store and operate matrices of this kind of problems is the Compressed Row Storage (CRS) [Saad03 p362]. This method is suitable when we want to access entries of each row of a matrix A sequentially. A that will contain the non-zero values of For each row i of A we will have two vectors, a vector v i A with their respective column indexes. For example a matrix A and its the row, and a vector j i CRS representation 8 4 A = ( 5 ) 1 2 8 4 0 0 0 0 1 3 0 0 1 3 0 0 3 4 2 1 7 2 0 1 0 7 0 , 1 3 5 A =( 9,3,1 ) v 4 0 9 3 0 0 1 9 3 1 0 0 0 0 0 2 3 6 A =( 2,3,6 ) j 4 5 6 The size of the row will be denoted by ∣ v i A ∣ or by ∣ j i A ∣ . Therefore the q th non zero value of the row i of A will be denoted by ( v i A ) q and the index of this value as ( j i A ) q , with q = 1, … , ∣ v i A ∣ . 7/24

  8. Cholesky factorization for sparse matrices Cholesky factorization for sparse matrices For full matrices the computational complexity of Cholesky factorization A = L L T is O ( n 3 ) . To calculate entries of L L j j ( A i j − ∑ L i k L j k ) , for i > j j − 1 L i j = 1 k = 1 L j j = √ A j j − ∑ j − 1 2 . L j k k = 1 We use four strategies to reduce time and memory usage when performing this factorization on sparse matrices: 1. Reordering of rows and columns of the matrix to reduce fill-in in L . This is equivalent to use a permutation matrix to reorder the system ( P A P T ) ( P x ) = ( P b ) . 2. Use symbolic Cholesky factorization to obtain an exact L factor (non zero entries in L ). 3. Organize operations to improve cache usage. 4. Parallelize the factorization. 8/24

  9. Cholesky factorization for sparse matrices Matrix reordering We want to reorder rows and columns of A , in a way that the number of non-zero entries of L are reduced. η ( L ) indicates the number of non-zero entries of L . A = L = The stiffness matrix to the left A ∈ℝ 556 × 556 , with η ( A ) = 1810 . To the right the lower triangular matrix L , with η ( L ) = 8729. There are several heuristics like the minimum degree algorithm [Geor81] or a nested dissection method [Kary99]. 9/24

  10. Cholesky factorization for sparse matrices By reordering we have a matrix A ' with η ( A ' ) = 1810 and its factorization L ' with η ( L ' ) = 3215 . Both factorizations solve the same system of equations. A ' = L ' = We reduce the factorization fill-in by η ( L ' ) = 3215 η ( L ) = 8729 = 0.368 . To determine a “good” reordering for a matrix A that minimize the fill-in of L is an NP complete problem [Yann81]. 10/24

  11. Cholesky factorization for sparse matrices Symbolic Cholesky factorization The algorithm to determine the L i j entries that area non-zero is called symbolic Cholesky factorization [Gall90]. Let be, for all columns j = 1 … n , a j ≝ { k > j ∣ A k j ≠ 0 } , l j ≝ { k > j ∣ L k j ≠ 0 } . The sets r j will register the columns of L which structure will affect the column j of L . r j ← ∅ , j ← 1 … n A = ( a 66 ) L = ( l 66 ) for j ← 1 … n l j ← a j a 11 a 12 a 16 l 11 a 21 a 22 a 23 a 24 l 21 l 22 for i ∈ r j a 32 a 33 a 35 l 32 l 33 l j ← l j ∪ l i ∖ { j } a 42 a 44 l 42 l 43 l 44 end_for p ← { l 53 l 54 l 55 a 53 a 55 a 56 min { i ∈ l j } if l j ≠∅ l 61 l 62 l 63 l 64 l 65 a 61 a 65 j other a 2 = { 3,4 } l 2 = { 3,4,6 } r p ← r p ∪ { j } end_for This algorithm is very efficient, its complexity in time and space has an order of O ( η ( L ) ) . 11/24

  12. Cholesky factorization for sparse matrices Parallelization of the factorization The calculation of the non-zero L i j entries can be done in parallel if we fill L column by column [Heat91]. Let J ( i ) be the indexes of the non-zero values of the row i of L . Formulae to calculate L i j are: L i j = 1 L j j ( L i k L j k ) , para i > j ∑ A i j − k ∈ ( J ( i ) ∩ J ( j ) ) k < j L j j = √ A j j − ∑ 2 L j k . k ∈ J ( j ) k < j Core 1 Core 2 Core N The paralellization was made using the OpenMP schema. 12/24

  13. Cholesky factorization for sparse matrices How efficient is it? The next table shows results solving a 2D Poisson equation problem, comparing Cholesky and conjugate gradient with Jacobi preconditioning. Several discretizations where used. Equations nnz(A) nnz(L) Cholesky [s] CGJ [s] 1,006 6,140 14,722 0.086 0.081 3,110 20,112 62,363 0.137 0.103 10,014 67,052 265,566 0.309 0.184 31,615 215,807 1’059,714 1.008 0.454 102,233 705,689 4’162,084 3.810 2.891 312,248 2’168,286 14’697,188 15.819 19.165 909,540 6’336,942 48’748,327 69.353 89.660 3’105,275 21’681,667 188’982,798 409.365 543.110 10’757,887 75’202,303 743’643,820 2,780.734 3,386.609 32,703,892,477 10,000 3,386.6 1E+10 2,780.7 7,134,437,212 4,656,139,711 1,000 543.1 1,747,287,767 409.4 1,343,496,475 1E+9 512,535,099 100 89.7 393,243,516 69.4 Memory [bytes] 145,330,127 134,859,928 19.2 1E+8 15.8 Time [s] 10 44,071,337 38,186,672 3.8 2.9 13,581,143 10,168,743 1E+7 1 1.0 4,276,214 Cholesky 0.5 Cholesky 2,632,403 0.3 CGJ CGJ 0.2 1,314,142 0.1 1E+6 0.1 0 0.1 0.1 707,464 417,838 0 1E+5 1,000 10,000 100,000 1,000,000 10,000,000 1,000 10,000 100,000 1,000,000 10,000,000 Equations Equations 13/24

Recommend


More recommend