Comparison of different solution strategies for structure - PowerPoint PPT Presentation

Comparison of different solution strategies for structure deformation using hybrid OpenMP-MPI methods Miguel Vargas, Salvador Botello 1/29

Description of the problem Description of the problem This is a high performance/large scale application case studie of the finite element method for solid mechanics. Our goal is to calculate linear deformation, stain an stress of solids dicretized  ∂ x  with large meshes (millions of elements) using parallel computing. ∂ 0 0 ∂ x ∂ 0 0 ∂ y  u 3  ∂ 0 0 u 1 ∂ z = u 2 ∂ ∂ 0 ∂ y ∂ x ∂ ∂ 0 ∂ z ∂ y ∂ ∂ 0 ∂ z = D  − 0   0 Where u is the displacement vector,  the stain,  the stress. D is called the constitutive matrix. The solution is found using the finite element method with the Galerkin weighted residuals. 2/29

Description of the problem By discretizing the domain, and applying boundary conditions (imposed displacements or applied forces), we assemble a linear system of equations K u = f , where K is the stiffness matrix, u is the displacement vector and f is the force vector. If the problem is well-defined K is symmetric positive definite (SPD). This matrix is also sparse . Our goal now is to solve big SPD systems of equations using parallel computing implemented in a Beowulf cluster. A group of multicore computers connected in a network. Master node Slave nodes External network Network switch We will combine shared and distributed memory techniques. 3/29

“Divide et impera” with MPI “Divide et impera” with MPI To distribute workload in big problems we partitionate the domain into several subdomains. 4/29

Domain decomposition (Schwarz alternating method) Domain decomposition (Schwarz alternating method) This is an iterative method to split a big problem in small problems. ∂   ∂  2 ∖ 2 ∂  1 ∖ 1  2  1 ∂  2 ∂  1  1  2 We have a domain  with boundary ∂ . Partitions  1 y  2 are overlapped, = 1 ∪ 2 . L is a differential operator L x = y in  .  1 and  2 are artificial boundaries, they belong to ∂ 1 and ∂ 2 and exist inside  . Dirichlet conditions x = b on ∂ . The domain is divided in two partitions  1 and  2 with boundaries ∂ 1 and ∂ 2 . 5/29

Domain decomposition (Schwarz alternating method) ∂  2 ∖ 2 ∂  1 ∖ 1  2  1 0 , x 2 0 initial approximations x 1 i − x 1 i − x 2 while ∥ x 1 i − 1 ∥  or ∥ x 2 i − 1 ∥  i = y i = y in  1 in  2 Solve L x 1 Solve L x 2 i = b i = b in ∂ 1 ∖ 1 in ∂ 2 ∖  2 with x 1 with x 2 i  x 2 i  x 1 i − 1 ∣ i − 1 ∣ in  1 in  2 x 1 Update x 2 Update  1  2 i  i  1 Adding overlapping improves convergence:  2  2  1  1  1  2  1  2 6/29

Domain decomposition (Schwarz alternating method) Implementation using MPI MPI (Message Passing Interface is a set of routines and programs to make easy to implement and administrate applications that ejecute several processes at the same time. It has routines to transmit data with great efficiency. It could be used in Beowulf clusters. Node Node Node Processor Processor Processor Processor Processor Processor Memory Memory Memory NIC NIC NIC Network switch NIC NIC Memory Memory Processor Processor Processor Processor Node Node • The idea is to assing a MPI process to handle and solve each partition. • OpenMP is used in each MPI process to solve the system of equations of the partition. • Values on artificial boundaries are interchanged using MPI message routines. • Schwarz iterations continue until global convergence is achived. • For partitioning we used METIS library. 7/29

Domain decomposition (Schwarz alternating method) 1E+1 1E+0 1E-1 norm(x' - x)/norm(x) P1 P2 P3 P4 1E-2 i = ∥ u t − 1 i ∥ i − u t 1E-3 d ∥ u t i ∥ 1E-4 1E-5 1E-6 1E-7 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 Iteration 8/29

Solving SPD sparse systems ofequations using shared memory Solving SPD sparse systems of equations using shared memory A simple but powerful way to program a multicore computer with shared memory is OpenMP. In modern computers the processor is a lot faster than Motherboard the memory, between them a high speed memory is used to improve data access. The cache . Processor The most importat issue to achieve high performance is 32KB 4MB cache L2 Core L1 to use the cache efficiently. Access to Cycles 32KB Core L1 Register ≤ 1 Bus RAM L1 ~ 3 Processor L2 ~ 14 Main memory ~ 240 32KB 4MB cache L2 Core L1 • Work with use continuous memory blocks. 32KB • Access memory in sequence. Core L1 • Each core should work in an independent memory area. Algorithms to solve our system of equations should take care of this. 9/29

Solving SPD sparse systems ofequations using shared memory Parallel preconditioned conjugate gradient for sparse matrices Preconditioning M  A x − b = 0  is used to r 0  b − A x 0 , initial residual p 0  M r 0 , initial descent direction improve CG convergence. k  0 Preconditioners must be sparse. while ∥ r k ∥  T M r k  k  − r k We tested three different preconditioners: T A p k p k − 1 . − 1 =  diag  A   • Jacobi M x k  1  x k  k p k r k  1  r k − k A p k • Incomplete Cholesky factorization − 1 = G l G l T T , G l ≈ L . M  k  r k  1 M r k  1 T M r k r k • Factorized sparse approximate inverese T H l , H l ≈ L p k  1  M r k  1  k p k − 1 . M = H l k  k  1 Matrix-vector and dot products are parallelized using OpenMP. Compress row storage method is used to store matrices. Only non-zero entries and their indexes are stored. Entries in each row are contiguos and sorted to improve cache access. Each processor core works with a group of rows to parallelize the operations. 10/29

Solving SPD sparse systems ofequations using shared memory It looks like: Vector<T> g(n); // Gradient double* __restrict A_entry_i = A.entry[i]; Vector<T> p(n); // Descent direcction Vector<T> w(n); // w = A*p double sum = 0.0; int k_max = A.count[i]; double gg = 0.0; for (register int k = 1; k <= k_max; ++k) #pragma omp parallel for default(shared) reduction(+:gg) schedule(guided) { for (int i = 1; i <= n; ++i) sum += A_entry_i[k]*p.entry[A_index_i[k]]; { } int* __restrict A_index_i = A.index[i]; w.entry[i] = sum; // w = AP double* __restrict A_entry_i = A.entry[i]; pw += p.entry[i]*w.entry[i]; // pw = p'*w } double sum = 0.0; int k_max = A.count[i]; double alpha = gg/pw; // alpha = (g'*g)/(p'*w) for (register int k = 1; k <= k_max; ++k) { double gngn = 0.0; sum += A_entry_i[k]*x.entry[A_index_i[k]]; #pragma omp parallel for default(shared) reduction(+:gngn) } for (int i = 1; i <= n; ++i) g.entry[i] = sum - b.entry[i]; // g = AX - b; { p.entry[i] = -g.entry[i]; // p = -g x.entry[i] += alpha*p.entry[i]; // Xn = x + alpha*p gg += g.entry[i]*g.entry[i]; // gg = g'*g g.entry[i] += alpha*w.entry[i]; // Gn = g + alpha*w } gngn += g.entry[i]*g.entry[i]; // gngn = Gn'*Gn } int step = 0; while (step < max_steps) double beta = gngn/gg; // beta = (Gn'*Gn)/(g'*g) { if (Norm(gg) <= tolerance) // Test termination condition #pragma omp parallel for default(shared) { for (int i = 1; i <= n; ++i) break; { } p.entry[i] = beta*p.entry[i] - g.entry[i]; // Pn = -g + beta*p } double pw = 0.0; #pragma omp parallel for default(shared) reduction(+:pw) schedule(guided) gg = gngn; for (int i = 1; i <= n; ++i) ++step; { } int* __restrict A_index_i = A.index[i]; 11/29

Solving SPD sparse systems ofequations using shared memory Parallel Cholesky factorización for sparse matrices Stiffnes matrix Cholesky factorization K = L L T , it is expensive to store and calculate L entries L j j  K i j − ∑ L i k L j k  , for i  j j − 1 L i j = 1 nnz  K  = 1810 nnz  L  = 8729 k = 1 L j j =  K j j − ∑ j − 1 2 . L j k k = 1 nnz  K '  = 1810 nnz  L '  = 3215 12/29

Solving SPD sparse systems ofequations using shared memory We used several strategies to make Cholesky factorization efficient: • Matrix ordering to reduce factorization fill-in . Minimum degree algorithm or nested disection algorithm (METIS library). • Symbolic Cholesky factorization to determine non-zero entries before calculation. • Factorization matrix is stored using compress row storage to improve forward-substitution. • L T is stored to improve speed of back-substitution. • The fill of each column of L is calculated in parallel using OpenMP. Core 1 Core 2 Core N This algorithm is also used to generate the incomplete Cholesky preconditioner. 13/29

Comparison of different solution strategies for structure - PowerPoint PPT Presentation

Comparison of different solution strategies for structure deformation using hybrid OpenMP-MPI methods Miguel Vargas, Salvador Botello 1/29 Description of the problem Description of the problem This is a high performance/large scale

2/17/2017 Continued from yesterday >java RealQueen 5 SOLUTION: 1 3 5 2 4 SOLUTION: 1 4 2 5

Algorithms in Bioinformatics: A Practical Introduction Phylogenetic Tree comparison and

The Perceptual Proxies of Visual Comparison Presented by: Youssef Sherif 1 Visual comparison

E&E MANAGEMENT PROFESSIONAL International Product and Solution Center Solution Background

Xoserve UK Link Programme RAASP Solution Options, Plan & Comparison Key Solution Options

A Comparison of Knee Strategies for Hierarchical Spatial Clustering Brian J. Ross Department of

Comparison of Object-Oriented Programming Languages Timothy Clark (488232) April 28, 2008

Comparison Based Merging Upper and Lower bounds EMADS Fall 2003: Comparison Based Merging Page 1

Uninformed Search strategies AIMA sections 3.4 Uninformed search strategies Uninformed Search

Different Story? CS4031 Introduction to Digital Media 2017 Same Story Different Medium;

Contents 1 Causal Inference and Predictive Comparison 2 1.1 How Predictive Comparison Can

Tamper amperLoks Loks Da DataV taVault ault Dr Drug ug Testing Solution esting Solution

The V The V The V The V- - - -30 Drilling Solution 30 Drilling Solution 30 Drilling

INNOVATIVE BALLAST WATER MANAGEMENT SHIP SOLUTION PORT SOLUTION OFFSHORE SOLUTION INTRODUCTION

Reliable solution for your needs LIGHT INDUSTRY SOLUTION COASTAL SOLUTION (Non IMO) 4 main

Panasonic Hybrid IP-PBX Solution Toward your Future NETCOM Panasonic Hybrid IP- -PBX

History of the Supercomputers Vincent Keller (Vincent.Keller@epfl.ch) CADMOS October 23, 2013

YapDss: an Or-Parallel Prolog System for Scalable Beowulf Clusters Ricardo Rocha Fernando Silva

Computer Science & Engineering 150A Administrivia Problem Solving Using Computers Overview

Virtualization instructor: Peter Baumann email: p.baumann@jacobs-university.de tel: -3178

Architectures for Parallel Processing Current Architectures for Parallel "With the

Welcome Overview of the week 29 April to 03 May, 2013 Week 18 29 Monday 30 Tuesday 1

Advanced Charm++ Tutorial Presented by: Isaac Dooley & Chao Mei 4/20/2007 1 Topics For

Distributed Processing Distributed Processing, Client/Server, and Clusters , Chapter 13

Comparison of different solution strategies for structure - PowerPoint PPT Presentation

Comparison of different solution strategies for structure deformation using hybrid OpenMP-MPI methods Miguel Vargas, Salvador Botello 1/29 Description of the problem Description of the problem This is a high performance/large scale

2/17/2017 Continued from yesterday &gt;java RealQueen 5 SOLUTION: 1 3 5 2 4 SOLUTION: 1 4 2 5

Algorithms in Bioinformatics: A Practical Introduction Phylogenetic Tree comparison and

The Perceptual Proxies of Visual Comparison Presented by: Youssef Sherif 1 Visual comparison

E&amp;E MANAGEMENT PROFESSIONAL International Product and Solution Center Solution Background

Xoserve UK Link Programme RAASP Solution Options, Plan &amp; Comparison Key Solution Options

A Comparison of Knee Strategies for Hierarchical Spatial Clustering Brian J. Ross Department of

Comparison of Object-Oriented Programming Languages Timothy Clark (488232) April 28, 2008

Comparison Based Merging Upper and Lower bounds EMADS Fall 2003: Comparison Based Merging Page 1

Uninformed Search strategies AIMA sections 3.4 Uninformed search strategies Uninformed Search

Different Story? CS4031 Introduction to Digital Media 2017 Same Story Different Medium;

Contents 1 Causal Inference and Predictive Comparison 2 1.1 How Predictive Comparison Can

Tamper amperLoks Loks Da DataV taVault ault Dr Drug ug Testing Solution esting Solution

The V The V The V The V- - - -30 Drilling Solution 30 Drilling Solution 30 Drilling

INNOVATIVE BALLAST WATER MANAGEMENT SHIP SOLUTION PORT SOLUTION OFFSHORE SOLUTION INTRODUCTION

Reliable solution for your needs LIGHT INDUSTRY SOLUTION COASTAL SOLUTION (Non IMO) 4 main

Panasonic Hybrid IP-PBX Solution Toward your Future NETCOM Panasonic Hybrid IP- -PBX

History of the Supercomputers Vincent Keller (Vincent.Keller@epfl.ch) CADMOS October 23, 2013

YapDss: an Or-Parallel Prolog System for Scalable Beowulf Clusters Ricardo Rocha Fernando Silva

Computer Science &amp; Engineering 150A Administrivia Problem Solving Using Computers Overview

Virtualization instructor: Peter Baumann email: p.baumann@jacobs-university.de tel: -3178

Architectures for Parallel Processing Current Architectures for Parallel &quot;With the

Welcome Overview of the week 29 April to 03 May, 2013 Week 18 29 Monday 30 Tuesday 1

Advanced Charm++ Tutorial Presented by: Isaac Dooley &amp; Chao Mei 4/20/2007 1 Topics For

Distributed Processing Distributed Processing, Client/Server, and Clusters , Chapter 13

2/17/2017 Continued from yesterday >java RealQueen 5 SOLUTION: 1 3 5 2 4 SOLUTION: 1 4 2 5

E&E MANAGEMENT PROFESSIONAL International Product and Solution Center Solution Background

Xoserve UK Link Programme RAASP Solution Options, Plan & Comparison Key Solution Options

Computer Science & Engineering 150A Administrivia Problem Solving Using Computers Overview

Architectures for Parallel Processing Current Architectures for Parallel "With the

Advanced Charm++ Tutorial Presented by: Isaac Dooley & Chao Mei 4/20/2007 1 Topics For