From Stencils to Elliptic PDE Solvers U. Rüde (FAU Erlangen, ulrich.ruede@fau.de) joint work with B. Gmeiner, H. Köstler, H. Stengel (FAU) M. Huber, C. Waluga, L. John, B. Wohlmuth (TUM) M. Mohr, J. Weismüller, P. Bunge (LMU) Lehrstuhl für Simulation FAU Erlangen-Nürnberg Advanced Stencil-Code Engineering 12. – 17. April 2015 Seminar 15161 Stencils and Elliptic Solvers - Ulrich Rüde 1
What are we up to? Stencil Code Engineering? one step of the designing efficient parallel algorithms: application ➞ model ➞ discretization ➞ solver ➞ simulation ➞ validation this week at Dagstuhl: opportunity to build a bridge between CS and Math but danger to babylonize the theme In my talk I will briefly touch 3 topics how can we optimize stencil codes? architecture awareness can bring large speedups what algorithms should be considered? Jacobi iteration is not enough where do we stand? HHG: stencil-based FE solver as an example Stencils and Elliptic Solvers - Ulrich Rüde 2
Stencils a geometric pattern of weights applied to a grid function at each location in a structured grid Example: the mother of all PDEs (Laplace equation in 2D) − 1 → 1 u h = f h − ∆ u = f − − 1 4 − 1 h 2 − 1 G h ⊂ h Z d structured grid in d dimensions: (scalar, real valued) grid function: an element of a vector space, i.e. on a structured grid u h ∈ G h → R constant stencils for a rectangular (cuboid) grid are related to sparse Toeplitz matrices lexicographic linearization of the grid function Toeplitz matrix: a banded matrix with constant entries along the diagonals In a finite grid: boundaries? Stencils and Elliptic Solvers - Ulrich Rüde 3
Applying a stencil A stencil can be applied to a grid function: „sweep“ results in a new grid function occurs in: filters (signal processing) linear iterative schemes simultaneous (Jacobi) versus successive (Gauss- Seidel) update When the stenicil has s enties and the grid has N points, then the computational cost is 2sN Flops The stencil application (sweep) is often memory bound To climb the memory wall one may use spatial and temporal blocking (when stencils are applied repeatedly) Stencils and Elliptic Solvers - Ulrich Rüde 4
Stencil optimizations https://www10.informatik.uni-‑erlangen.de/Research/Projects/DiME/index.html DFG Project (1996-2008) but also mem layout Data Local Iterative Methods For transformations: red-black/ The Efficient Solution of Partial padding ... Differential Equations Example: temporal skewed 2D DiME Pack software blocking - especially beneficial for in-order architecture with small&fast Focus on cache blocking L1 cache. techniques and interplay with CPU micro architecture (pipelining, superscalarity, in-order/out-of-order, etc.) systematic performance analysis, monitoring tools, perf counters, ... 2D and 3D grids, mostly structured tiling, blocking, fusing, ... Stencils and Elliptic Solvers - Ulrich Rüde 5
DiME project archive 1. M. Stürmer, H. Köstler, and U. Rüde. A fast full boltzmann kernels. computers & fluids, 35(8–9):910– 21.Josef Weidendorfer and Carsten Trinitis. Collecting multigrid solver for applications in image processing. 919, November 2006. and Exploiting Cache-Reuse Metrics. In ICCS 2005: Numer. Linear Algebra Appl., 15:187–200, 2008 . 12.M. Stürmer, J. Treibig, and U. Rüde. Optimizing a 3D 5th International Conference on Computational 2. Josef Weidendorfer and Carsten Trinitis. Off-loading Multigrid Algorithm for the IA-64 Architecture. In Proc. Science, volume 3515 of LNCS, pages 191-198. Application controlled Data Prefetching in numerical of the ASIM-06 Conf., Frontiers in Simulation. SCS, Springer, May 2005. Codes for Multi-Core Processors. Int. J. High 2006. 22.Josef Weidendorfer and Carsten Trinitis. Collecting Performance Computing and Networking, 4(1):22–28, 13.Josef Weidendorfer and Carsten Trinitis. Block and Exploiting Cache-Reuse Metrics. In ICCS 2005: 2008. Prefetching for Numerical Codes. In Proc. of the 5th International Conference on Computational 3. M. Stürmer, J. Treibig, and U. Rüde. Optimising a 3D ASIM-06 Conf., Frontiers in Simulation. SCS, 2006. Science, volume 3515 of LNCS, pages 191–198. Multigrid Algorithm for the IA-64 Architecture. 14.A. Nitsure, K. Iglberger, U. Rüde, C. Feichtinger, Springer, May 2005 International Journal of Computational Science and G. Wellein, and G. Hager. Optimization of Cache 23.B. Bergen, F. Hülsemann, and U. Rüde. Is 1.7 × 10 10 Engineering (IJCSE), 4(1):29–35 , 2008. Oblivious Lattice Boltzmann Method in 2D and 3D. In Unknowns the Largest Finite Element System that 4. Tobias Gradl and Ulrich Rüde. Massively Parallel Proc. of the ASIM-06 Conf., Frontiers in Simulation. Can Be Solved Today? In SC ´05: Proceedings of the Multilevel Finite Element Solvers on the Altix 4700. SCS, 2006. 2005 ACM/IEEE conference on Supercomputing, inSiDE, 5(2):24–29, 2007. 15.A. Nitsure. Implemenation and optimization of a Washington, DC, USA, 2005. IEEE Computer Society. 5. C. Freundl, T. Gradl, U. Rüde, and B. Bergen. cache-oblivious Lattice Boltzmann algorithm. Master´s 24.T. Pohl, N. Thürey, F. Deserno, U. Rüde, P. Lammers, Petascale Computing: Algorithms and Applications, thesis, Lehrstuhl für Informatik 10 (Systemsimulation), G. Wellein, and T. Zeiser. Performance Evaluation of Towards Petascale Multilevel Finite Element Solvers. Friedrich-Alexander-Universität Erlangen-Nürnberg, Parallel Large-Scale Lattice Boltzmann Applications Chapman & Hall/CRC, December 2007. August 2006. on Three Supercomputing Architectures. November 6. M. Stürmer, J. Götz, G. Richter, and U. Rüde. Blood 16.Josef Weidendorfer and Carsten Trinitis. Cache 2004. Supercomputing Conference 04. Flow Simulation on the Cell Broadband Engine using Optimizations for Iterative Numerical Codes Aware of 25.Markus Kowarschik. Data Locality Optimizations for the Lattice Boltzmann Method. Technical Report 07-9, Hardware Prefetching. volume 3732 of Lecture Notes Iterative Numerical Algorithms and Cellular Automata Lehrstuhl für Informatik 10 (Systemsimulation), in Computer Science, pages 921–927. Springer, 2006. on Hierarchical Memory Architectures. PhD thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg, 17.J. Götz. Simulation of bloodflow in aneurysms using July 2004, SCS Publishing House, Germany. September 2007. the lattice boltzmann method and an adapted data 26.Markus Kowarschik, Iris Christadler and Ulrich Rüde. 7. H. Köstler, M. Stürmer, C. Freundl, and U. Rüde. PDE structure. Technical Report 06-6, 2006 Towards Cache-Optimized Multigrid Using Patch- based Video Compression in Real Time. Technical 18.S. Donath, T. Zeiser, G. Hager, J. Habich, and Adaptive Relaxation. In /Proceedings of the 2004 Report 07-11, Lehrstuhl für Informatik 10 G. Wellein. Optimizing Performance of the Lattice Conference on Applied Parallel Computing (Systemsimulation), Friedrich-Alexander-Universität Boltzmann Method for Complex Structures on Cache- (PARA'04)/, Copenhagen, Denmark, June 2004. Erlangen-Nürnberg, August 2007. based Architectures. In F. Hülsemann, M. Kowarschik, Lecture Notes in Computer Science (LNCS), Springer. 8. M. Stürmer, H. Köstler, and U. Rüde. A fast multigrid and U. Rüde, editors, 18th Symposium 27.Josef Weidendorfer, Markus Kowarschik, and Carsten solver for applications in image processing. Technical Simulationstechnique ASIM 2005 Proceedings, Trinitis. A Tool Suite for Simulation Based Analysis of Report 07-6 volume 15 of Frontiers in Simulation, pages 728–735. Memory Access Behavior. In Proceedings of the 2004 9. C. C. Douglas, U. Rüde, J. Hu, and M. L. Bittencourt. ASIM, SCS Publishing House, September 2005. International Conference on Computational Science, A Guide to Designing Cache Aware Multigrid 19.J. Treibig, S. Hausmann, and U. Rüde. Performance Krakow, Poland, June 2004. Lecture Notes in Algorithms. Technical Report 07-3, Analysis of the Lattice Boltzmann Method on x-86-64 Computer Science (LNCS), vol. 3038, Springer. 10.B. Bergen, T. Gradl, F. Hülsemann, and U. Rüde. A Architectures. In F. Hülsemann, M. Kowarschik, and 28.Jan Treibig et al. Performance Analysis of the Lattice Massively Parallel Multigrid Method for Finite U. Rüde, editors, 18th Symposium Boltzmann Method on x86-64 Architectures. In Elements. Computing in Science and Engineering. Simulationstechnique ASIM 2005. Proceedings of the ASIM-05 Conference, volume 2790 8(6):56–62, December 2006. 20.B. Bergen. Hierarchical Hybrid Grids: Data Structures of Frontiers in Simulation, pages 441-450. SCS, 2003. 11.G. Wellein, T. Zeiser, G. Hager, and S. Donath. On and Core Algorithms for Efficient Finite Element 29.Markus Kowarschik and Christian Weiß. An Overview the single processor performance of simple lattice Simulations on Supercomputers. PhD thesis, FAU of Cache Optimization Techniques and Cache-Aware Erlangen, 2005. Numerical Algorithms. Proceedings of the GI-Dagstuhl Stencils and Elliptic Solvers - Ulrich Rüde 6
ExaStencils DFG SPPExa 2013 - 15 Domain Specific Language (DSL) approach http://www.exastencils.org optimizing stencil codes by transformations in the context of multigrid algorithms several talks upcoming here at Dagstuhl Stencils and Elliptic Solvers - Ulrich Rüde 7
4 -1 -1 u 11 -1 -1 -1 4 -1 u 1 n -1 4 -1 u 21 -1 -1 -1 -1 4 u 2 n = -1 Algorithms: -1 -1 4 -1 -1 -1 -1 -1 4 u nn Good stencil codes are hierarchical Stencils and Elliptic Solvers - Ulrich Rüde 8
Recommend
More recommend