parallelization of scientific applications i
play

Parallelization of Scientific Applications (I) A Parallel Structured - PowerPoint PPT Presentation

Parallelization of Scientific Applications (I) A Parallel Structured Flow Solver - URANUS Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 4. Day, 30 th of June, 2005 HLRS, University of


  1. Parallelization of Scientific Applications (I) A Parallel Structured Flow Solver - URANUS Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 4. Day, 30 th of June, 2005 HLRS, University of Stuttgart Slide 1 High Performance Computing Center Stuttgart

  2. Outline • Introduction • Basics • Boundary Handling • Example: Finite Volume Flow Simulation on Structured Meshes • Example: Finite Element Approach on an Unstructured Mesh Slide 2 High Performance Computing Center Stuttgart

  3. URANUS - Overview • Calculation of 3D reentry flows – High mach number – High temperature – Chemistry • Calculation of heat flow on the surface – Full catalytic and semi catalytic surfaces Slide 3 High Performance Computing Center Stuttgart

  4. URANUS - Numerics • cell-center oriented finite-volume approach for discretization of the unsteady, compressible Navier-Stokes equations in space • time integration accomplished by the Euler backward scheme • the implicit equation system is solved iteratively by Newton’s method • two different limiters for second order accuracy • Jacobi line relaxation method with subiterations to speedup convergence • special handling of the singularity in one-block c-meshes Slide 4 High Performance Computing Center Stuttgart

  5. Parallelization - Target • High Application Performance • Calculation of full configrations in 3D with chemical reactions – Performance Issue – Memory Issue • Calculation of complex topologies • Using real big MPP’s – no loss in efficiency even when using 500 Processors and more • Using large Vector Systems Slide 5 High Performance Computing Center Stuttgart

  6. Re-entry Simulation - X38 (CRV) Slide 6 High Performance Computing Center Stuttgart

  7. Amdahls Law • Lets assume a problem with fixed size • The serial fraction s • The parallel fraction p • Number of processors n • Degree of parallelization p β = + • Speedup s p 1 1   → →  ∞ n  −  − β 1 1 − β 1  1   n  Does parallelization pay off? Slide 7 High Performance Computing Center Stuttgart

  8. Amdahls Law for 1 - 64 processors 64 60 50 40 32 30 20 16 10 8 4 0 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Slide 8 High Performance Computing Center Stuttgart

  9. Problem with Amdahls Law • der Schluss is obviously correct • But it does not taggle an important precondition • Problem size is considered to be constant • But this is typically not true for all kind of simulations • Computers are always too small • The problem size shall grow, if ever possibel Slide 9 High Performance Computing Center Stuttgart

  10. Gustafsons Law • Lets assume a problem of growing size • Number of Processors n • Constant serial fraction s • The parallel fraction scales = p ( n ) p n • linearer Speedup 1 + + s p ( n ) s p n s p = = + 1 1 n p ( n ) + + + s p s p s p + s 1 1 1 n Parallelization may pay off Slide 10 High Performance Computing Center Stuttgart

  11. Parallelization of CFD Applications - Principles Slide 11 High Performance Computing Center Stuttgart

  12. A Problem (I) Flow around a cylinder: Numerical Simulation using FV, FE or FD Data Structure: A(1:n,1:m) Solve: (A+B+C)x=b Movie: Lutz Tobiska, University of Magdeburg http://david.math.uni-magdeburg.de/home/john/cylinder.html Slide 12 High Performance Computing Center Stuttgart

  13. Parallelization strategies do i=1,100 Work decomposition � i=1,25 i=26,50 Scaling ? Flow around a cylinder: Numerical Simulation using FV, FE or FD i=51,75 Data Structure: A(1:n,1:m) i=76,100 Solve: (A+B+C)x=b A(1:20,1:50) Data decomposition A(1:20,51:100) A(1:20,101:150) Scales A(1:20,151:200) too much communication ? Domain decomposition Good Chance Slide 13 High Performance Computing Center Stuttgart

  14. Parallelization Problems • Decomposition (Domain, Data, Work) • Communication = − du dx / ( u u ) / dx + − i 1 i 1 u i-1 u i+1 Slide 14 High Performance Computing Center Stuttgart

  15. Concepts - Message Passing (II) User defined communication Slide 15 High Performance Computing Center Stuttgart

  16. How to split the Domain in the Dimensions (I) 1 - Dimensional 2 (and 3) - Dimensional Slide 16 High Performance Computing Center Stuttgart

  17. How to split the Domain in the Dimensions (II) • That depends on: – computational speed i.e. processor: vectorprocessor or cache – communication speed: • latency • bandwidth • topology – number of subdomains needed – load distribution (is the effort for every mesh cell equal) Slide 17 High Performance Computing Center Stuttgart

  18. Replication versus Communication (I) • If we need a value from a neighbour we have basically two opportunities – getting the necessary value directly from the neighbour, when needed Communication, Additional Synchronisation – calculating the value of the neighbour again locally from values known there Additional Calculation • Selection depends on the application Slide 18 High Performance Computing Center Stuttgart

  19. Replication versus Communication (II) • Normally replicate the values – Consider how many calculations you can execute while only sending 1 Bit from one process to another (6 µ s, 1.0 Gflop/s � 6000 operations) – Sending 16 kByte (20x20x5) doubles (with 300 MB/s bandwidth � 53.3 µ s � 53 300 operations) – very often blocks have to wait for their neighbours – but extra work limits parallel efficiency • Communication should only be used if one is quite sure that this is the best solution Slide 19 High Performance Computing Center Stuttgart

  20. 2- Dimensional DD with two Halo Cells Mesh Partitioning Subdomain for each Process Slide 20 High Performance Computing Center Stuttgart

  21. Back to URANUS Slide 21 High Performance Computing Center Stuttgart

  22. Analysis of the Sequentiell Program • Written in FORTRAN77 • Using structured meshes • Parts of the program – Preprocessing, reading Data – Main Loop • Setup of the equation system • Preconditioning • Solving step – Postprocessing, writing Data Slide 22 High Performance Computing Center Stuttgart

  23. Deciding for the Data Model • We will use Domain Decomposition – The domain is split into subdomians – large topologies and large number of subdomains necessary � 3D – Domain Decomposition • Each cell has in maximum 6 neighbors – Now 2 boundary types: • Physical boundary – Subdomain boundary is domain boundary • Inner boundary – Subdomain boundary is boundary to another subdomain – Data exchange between subdomains necessary � communication Slide 23 High Performance Computing Center Stuttgart

  24. Data Distribution • Domain is splitted by a simple algorithm – Make subdomains equal sized � minor load balancing issue • Each subdomain is calculated by its own process (on its own processor) • Data needs to be distributed before calculating – One process reads all data – Data are then distributed to all the other processes • Bottleneck • Sequential part – Parallel read would have been better, but MPI-I/O was not available that time Slide 24 High Performance Computing Center Stuttgart

  25. Dynamic Data Structures • Pure FORTRAN77 is too static – number of processors can vary from run to run size of arrays even within the same case can vary – dynamic data structurs • use Fortran90 dynamic arrays • use all local memory on a PE for a huge FORTARN77 array and setup your own memory management • second method has a problem on SMP’s and cc- NUMA’s we should only use as much memory as necessary Slide 25 High Performance Computing Center Stuttgart

  26. Inauguration of the Dynamic Data Structure • FORTRAN77 common /geo/ x(0:n1m,0:n2m,0:n3m), y(0:n1m,0:n2m,0:n3m), z ... • Fortran90 – Direct usage of dynamic arrays not possible � common block – Usage of Fortran90 pointers common /geo/ x, y, z ... real, pointer :: x(:,:,:) real, pointer :: y(:,:,:) – Allocation and Deallocation in the main program necessary Slide 26 High Performance Computing Center Stuttgart

  27. Hints for (Dynamic) Data Structure • Do not use global data structures – All data in a subroutine should be local – Data should be moved to the subroutine during the call – Better maintainability of the program • This was done by a total reengineering of the URANUS code in a later step Slide 27 High Performance Computing Center Stuttgart

  28. Main Loop (I) • Setup of the equation system – each cell has 6 neighbours – needs data from neighbour cells – 2 halo cells at the inner boundaries – Special part for handling physical boundaries Slide 28 High Performance Computing Center Stuttgart

  29. Numbering of the cells in the subdomains Slide 29 High Performance Computing Center Stuttgart

Recommend


More recommend