rexi breaking the time step constraint
play

REXI: breaking the time step constraint David Acreman, Jemma - PowerPoint PPT Presentation

REXI: breaking the time step constraint David Acreman, Jemma Shipton, Colin Cotter and Beth Wingate Why REXI? Trends in processor design are towards increasing number of cores Strong scaling of domain decomposition is limited


  1. REXI: breaking the time step constraint David Acreman, Jemma Shipton, Colin Cotter and Beth Wingate

  2. Why REXI? • Trends in processor design are towards increasing number of cores • Strong scaling of domain decomposition is limited • Timestep limits weak scaling • We need to find parallelism elsewhere https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/

  3. Rational approximation of exponential integrator (REXI) Apply n forward Euler time steps Approximate the exponential α k and β k are pre-computed complex numbers. Terms in the summation can be calculated in parallel Schreiber et al, 2017, Beyond spatial scalability limitations with a massively parallel method for linear oscillatory problems, International Journal of High Performance Computing Applications

  4. Rational approximation of exponential integrator (REXI) No. of Gaussians Width of Gaussian Approximate the exponential using Gaussian basis functions Approximate Gaussians as sum of rational terms a l and μ are pre-computed constants (Haut et al, 2015) Terms in the sum over M can be calculated in parallel hM > |tλ MAX | Haut et al, 2015, A high-order time-parallel scheme for solving wave propagation problems via the direct construction of an approximate time-evolution operator, IMA Journal of Numerical Analysis (2016) 36, 688–716

  5. REXI study • REXI results presented in Schreiber et al (2017) for benchmark problems applied to shallow water equations • We will also solve the shallow water equations but with some significant di ff erences: • Finite di ff erence or spectral → finite elements (Firedrake) • Regular unit square → icosahedral sphere in physical co-ordinates • Looking for speed up over conventional time stepping Schreiber et al, 2017, Beyond spatial scalability limitations with a massively parallel method for linear oscillatory problems, International Journal of High Performance Computing Applications

  6. Convergence tests • Initial conditions: polar wave • Run REXI with varying number of terms (M) with h=0.2 (width of Gaussian) • Check L2 error norm vs reference solution (implicit mid- point method with 25s time step) • Increase REXI time step (t) and determine the number of terms (M) required to achieve convergence • Expect: hM > |tλ MAX |

  7. h=0.2, refinement level=3 1x10 11 t=7 500s t=15 000s hM > |tλ MAX | t=30 000s 1x10 10 t=60 000s t=120 000s 1x10 9 U L 2 error norm 1x10 8 1x10 7 1x10 6 λ MAX t/ks M 100000 7.5 64 0.0017 10000 15 112 0.0015 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 Number of REXI terms (M) 30 224 0.0015 Increasing t requires larger M (linear) ✅ 60 432 0.0014 Increasing t increases error ✅ 120 864 0.0016

  8. hM is constrained but what about h on its own? 1x10 11 h=0.1 t=30 000s, refinement level=3 h=0.2 h=0.4 h=0.8 1x10 10 h=1.6 h=2.4 h=3.2 1x10 9 U L 2 error norm 1x10 8 1x10 7 h M hxM 1x10 6 0.2 224 44.8 0.4 112 44.8 100000 32 64 96 128 160 192 224 256 288 320 352 384 0.8 64 51.2 Number of REXI terms (M) hM > |tλ MAX |≈45 ⇒ λ MAX ≈0.0015 1.6 32 51.2

  9. Can we use h=1.6 with a larger t? 1x10 11 1x10 11 h=0.1 h=0.1 t=30 000s t=60 000s h=0.2 h=0.2 h=0.4 h=0.4 h=0.8 h=0.8 1x10 10 1x10 10 h=1.6 h=1.6 1x10 9 1x10 9 U L 2 error norm U L 2 error norm 1x10 8 1x10 8 1x10 7 1x10 7 1x10 6 1x10 6 100000 50 100 150 200 250 300 350 100000 50 100 150 200 250 300 350 Number of REXI terms (M) Number of REXI terms (M) 1x10 11 1x10 11 h=0.1 t=120 000s h=0.1 h=0.2 t=240 000s h=0.2 h=0.4 h=0.4 h=0.8 h=0.8 h=1.6 1x10 10 h=1.6 1x10 10 U L 2 error norm U L 2 error norm 1x10 9 1x10 9 1x10 8 1x10 8 1x10 7 1x10 7 1x10 6 1x10 6 50 100 150 200 250 300 350 50 100 150 200 250 300 350 Number of REXI terms (M) Number of REXI terms (M)

  10. What about resolution ( λ max )? 1x10 11 1x10 11 h=0.1 h=0.1 h=0.2 refinement level=2 refinement level=3 h=0.2 h=0.4 h=0.4 h=0.8 1x10 10 h=0.8 1x10 10 h=1.6 h=1.6 1x10 9 1x10 9 U L 2 error norm U L 2 error norm 1x10 8 1x10 8 1x10 7 1x10 7 1x10 6 1x10 6 100000 100000 50 100 150 200 250 300 350 50 100 150 200 250 300 350 Number of REXI terms (M) Number of REXI terms (M) 1x10 11 1x10 11 h=0.1 h=0.1 refinement level=4 h=0.2 refinement level=5 h=0.2 h=0.4 h=0.4 h=0.8 h=0.8 1x10 10 h=1.6 1x10 10 h=1.6 1x10 9 U L 2 error norm U L 2 error norm 1x10 9 1x10 8 1x10 8 1x10 7 1x10 7 1x10 6 1x10 6 100000 50 100 150 200 250 300 350 50 100 150 200 250 300 350 Number of REXI terms (M) Number of REXI terms (M)

  11. Scaling tests • Measure time for a single REXI step using PyOP2 timed stage (average over three runs, no I/O in timed region) • h=0.2 and 1.6, minimum M for convergence, refinement level 3 • Single node scaling on Archer: 24 cores per node (2x12) • Specify placement to ensure MPI processes are distributed evenly between sockets

  12. h=0.2, refinement level=3 1200 t=7500, M=64 t=15000, M=112 1100 t=30000, M=224 t=60000, M=432 t=120000, M=864 1000 Model time / Wallclock time 900 800 700 600 500 400 300 200 0 4 8 12 16 20 24 No. of processors Reference solution: 115 (1 proc) → 1300 (24 procs)

  13. h=1.6, refinement level=3 9000 t=30000, M=32 t=60000, M=64 t=12000, M=112 8000 t=240000, M=224 7000 Model time / Wallclock time 6000 5000 4000 3000 2000 1000 0 4 8 12 16 20 24 No. of processors Reference solution: 115 (1 proc) → 1300 (24 procs)

  14. Future work • What value of h to use? Does this depend on the initial conditions (or other factors)? • How to trade-o ff speed and accuracy? • For a given spatial resolution (a ff ects λ MAX ) and t • Determine maximum h and minimum M for convergence ( hM > |tλ MAX | ) • Measure error vs reference solution and time to solution • Improve time to solution by reducing MPI overhead: examine in more detail with profiler (e.g. determine load balance)

  15. Build with Intel toolchain and run DG advection example under MPI profiler: Each line is an MPI process Time in MPI_Bcast Communication between processes

Recommend


More recommend