speeding up reactive transport code using openmp
play

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin - PDF document

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for parallelizing Fortran and C/C++ on shared memory systems Minimal changes to sequential code required Incremental parallelization OpenMP


  1. Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP • A standard for parallelizing Fortran and C/C++ on shared memory systems • Minimal changes to sequential code required • Incremental parallelization • OpenMP ‐ Compliant and normal compilers – !$OMP • No message passing between processors • Fine ‐ and course ‐ grained Parallelism – Do Loop – Sections

  2. Threads A thread is forked to start the parallel region and joined at the end of the region. The thread that was forked is called the master, while the other threads are the workers. What is a thread? Parallel Region Constructor !$OMP PARALLEL clause1 clause2… …parallel code is placed here… !$OMP END PARALLEL Optional clauses include • PRIVATE ( list ) • SHARED ( list ) • DEFAULT (PRIVATE | SHARED | NONE) • FIRSTPRIVATE ( list ) • REDUCTION ( operator:list ) • IF ( scalar logical expression ) • NUM THREADS ( scalar integer expression )

  3. Work ‐ Sharing Constructs !$OMP DO clause1 clause2… !$OMP SECTIONS clause1 clause2 ... !$OMP SECTION DO i=1, N ... parallel code is placed here… …parallel code is placed here… !$OMP SECTION END DO ... parallel code is placed here… !$OMP END DO end_clause !$OMP END SECTIONS end_clause Optional clauses include Optional clauses include • PRIVATE ( list ) • PRIVATE ( list ) • FIRSTPRIVATE ( list ) • FIRSTPRIVATE ( list ) • LASTPRIVATE ( list ) • LASTPRIVATE ( list ) • REDUCTION ( operator:list ) • REDUCTION ( operator:list ) • SCHEDULE ( type, chunk ) !$OMP SINGLE clause1 clause2 ... !$OMP WORKSHARE ... ... !$OMP END SINGLE end_clause !$OMP END WORKSHARE end_clause Optional clauses include • PRIVATE ( list ) • FIRSTPRIVATE ( list ) Clauses Shared(list) Private(list) • Same location of variable • Each thread has its own available to all threads; copy of the variable exists before and after considered to be local to parallel region that parallel construct � Must check that no race • Private variables have to be conditions occur initialized inside the parallel region and are considered Default(NONE|SHARED|PRIVATE) to be undefined outside of Any unstated variables can be that region • defaulted to shared or private • Do Loop Counters are None says all variables must be • always private declared in the shared or private clauses

  4. Clauses FIRSTPRIVATE ( list ) – gives the private variable an initialized value of the original variable when entering the parallel region LASTPRIVATE ( list ) – gives the exiting private variable the value of the last iteration or final section REDUCTION ( operator:list ) – to ensure that a shared variable location is written to one thread at a time; each thread has a private copy of the shared variable that gets updated at the end of the parallel region; operators include +, *, ‐ , .AND., .OR., .EQV., .NEQV., MAX, MIN, IAND, IOR or IEOR IF ( scalar logical expression ) – allows the parallel region to be run sequentially if the expression is false NUM THREADS ( scalar integer expression ) – allows the number of threads the region is fork into to be declared (still an optional command that is not needed) Clauses SCHEDULE ( type, chunk ) – type can be static, dynamic, or guided; help determine the efficiency of the code Static ‐ divides the iterations statically in the beginning between the • threads; if a chunk size is set the last thread may have a different number of iterations then the others; offers the best performance if all the iterations require the same computational time Dynamic – each thread is given a small amount of work the size of chunk • and when it is done it is given more; if the chunk is not specified the default is one; obviously increases overhead Guided – gives a combination of the two by handing out large loads at • first and then handing out smaller loads decreasing exponentially NOWAIT – an end_clause causing the threads not to wait at the end of a work ‐ sharing region, but to continue on to the next work ‐ sharing region; without this clause there is an implied barrier for all the threads to catch up and synchronize with each other

  5. Van der Pas, R. (2005, June 1 ‐ 4). An Introduction into OpenMP . Presented at the University of Oregon. REACTION TRANSPORT MODELING

  6. Performance Analysis Note: debug versus release mode Governing Equation: ∂ ∂ ∂ 2 C C C = − + + + v D SS reactions ∂ ∂ ∂ 2 t x x Operator Split: ∂ ∂ ∂ ∂ C C 2 C C = − = v D ∂ ∂ ∂ ∂ t x 2 t x ∂ C = ∂ C = SS reactions ∂ ∂ t t

  7. 100 Species in 1D Column after 40 years 100 Species Problem Specifics • Parallelized in two places – Advection ‐ Dispersion Equation with parallel Do ‐ Loop species iterations split between threads – Reactions with parallel Do ‐ Loop node iterations split between threads the same way RT3D is done • Results presented from Debug mode runs Simulation Time (yr) 40 Length (m) 2000 Velocity (m/yr) 5 ∆ x 1 ∆ t 0.1 Dispersion coefficient; D x (m^2/yr) 50 Courant 0.5 Peclet 0.1

  8. Timing ‐ (Static Scheduling) 1 2 3 4 Program Run Time 172.77765 88.43443 62.32897 49.229939 Program Speedup 1.953737 2.772028 3.5096052 Efficiency 0.976869 0.924009 0.8774013 Reaction Run Time 134.536829 66.42403 44.81983 35.355174 Reaction Speedup 2.025424 3.001726 3.80529393 Efficiency 1.012712 1.000575 0.95132348 Dispersion RunTime 35.341182 18.9151 14.41493 10.701726 Adv ‐ Disp Speedup 1.868411 2.451707 3.3023815 Efficiency 0.934206 0.817236 0.82559538 Time Spent in Reactions 77.87% 75.11% 71.91% 71.82% Time Spent in Adv ‐ Disp 20.45% 21.39% 23.13% 21.74% Don’t focus on the Program speedup and efficiency, just the parallelized sections. Timing ‐ (Guided Scheduling) 1 2 3 4 Program Run Time 172.77765 93.36956 65.11877 51.98954 Program Speedup 1.850471 2.65327 3.32331561 Efficiency 0.925235 0.884423 0.8308289 Reaction Run Time 134.536829 66.10992 44.28971 33.657039 Reaction Speedup 2.035048 3.037654 3.99728654 Efficiency 1.017524 1.012551 0.99932164 Dispersion RunTime 35.341182 23.54525 17.70636 14.985179 Adv ‐ Disp Speedup 1.50099 1.99596 2.35840907 Efficiency 0.750495 0.66532 0.58960227 Time Spent in Reactions 77.87% 70.80% 68.01% 64.74% Time Spent in Adv ‐ Disp 20.45% 25.22% 27.19% 28.82%

  9. Timing ‐ (Static Adv ‐ Disp & Guided Reactions) 1 2 3 4 Program Run Time 172.77765 88.07855 61.75762 48.335757 Program Speedup 1.961631 2.797673 3.57453076 Efficiency 0.980816 0.932558 0.89363269 Reaction Run Time 134.536829 66.12452 44.33819 34.313412 Reaction Speedup 2.034598 3.034333 3.92082341 Efficiency 1.017299 1.011444 0.98020585 Dispersion RunTime 35.341182 18.84921 14.26557 10.735491 Adv ‐ Disp Speedup 1.874943 2.477375 3.29199494 Efficiency 0.937471 0.825792 0.82299873 Time Spent in Reactions 77.87% 75.07% 71.79% 70.99% Time Spent in Adv ‐ Disp 20.45% 21.40% 23.10% 22.21% Superlinear Speedup 100 Species Runtimes

  10. 100 Species Speedup Vinyl Chloride after 10000 days

  11. RT3D Problem Specifics A Program called MT3D solves the advection, dispersion, and • source/sink equations and calls the RT3D subroutines to solve the reactions equation The specific problem solved in this example was the sequential • decay of PCE, TCE, DCE, and VC. The continuous source spill concentration of PCE was 1000 mg/L • at the well. The initial levels of all chemicals in the aquifer was 0.0 mg/L. • The site was 510 m x 310 m x 100 m. This created a grid 51x31x10. • The reactions solved were as follows: • day ‐ 1 k1 0.005 R PCE = ‐ k 1 * [PCE] day ‐ 1 k2 0.003 k3 0.002 day ‐ 1 R TCE = k 1 *Y TCE/PCE *[PCE] ‐ k 2 * [TCE] k4 0.001 day ‐ 1 R DCE = k 2 *Y DCE/TCE *[TCE] – k 3 * [DCE] YTCE/PCE 0.7920 R VC = k 3 *Y VC/DCE *[DCE] – k 4 * [VC] YDCE/TCE 0.7377 YVC/DCE 0.6445 Results presented from a Release mode version • Timing ‐ Loop Around Row Do Loop (Static Scheduling) 1 2 3 4 Program Run Time 394.4463 284.5834 258.6008 240.5579 Program Speedup 1.386048 1.525309 1.639715 Efficiency 0.693024 0.508436 0.409929 Rt3d Run Time 229.1753 120.3458 94.81803 75.2866 Rt3d Speedup 1.904307 2.417002 3.044039 Efficiency 0.952153 0.805667 0.76101 Time Spent in Rt3d 58.10% 42.29% 36.67% 31.30% Don’t focus on the Program speedup and efficiency, just the parallelized sections.

  12. Timing ‐ Loop Around Row Do Loop (Guided Scheduling) 1 2 3 4 Program Run Time 394.4463 280.7615 247.1585 227.8098 Program Speedup 1.404916 1.595924 1.731472 Efficiency 0.702458 0.531975 0.432868 Rt3d Run Time 229.1753 117.2388 80.37176 60.98532 Rt3d Speedup 1.954774 2.851441 3.757877 Efficiency 0.977387 0.95048 0.939469 Time Spent in Rt3d 58.10% 41.76% 32.52% 26.77% RT3D Decay Problem Runtimes

  13. RT3D Decay Problem Runtimes Conclusion Clearly the capabilities of OpenMP are limited to the available • computer architectures. Much more speedup is possible with hundreds of processors in a cluster system possibly using Message Passing Interface routines, but OpenMP leaves code intact sequentially, is easy to implement, and accomplishes great speedup when a limited number of processors are available in a shared memory system. Options for future research can include a Hybrid ‐ MPI/OpenMP • code utilizing the benefits of both standards. OpenMP is available primarily in commercial compilers such as Intel • Visual Fortran and PGI compilers. Omni compiler might be free with OpenMP • http://phase.hpcc.jp/Omni/ ‐ I have not tried it so I don’t know if it works.

Recommend


More recommend