Is 2.44 trillion unknowns the largest finite element system that can be solved today? U. Rüde (LSS Erlangen, ruede@cs.fau.de) Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de Advances in Numerical Algorithms and High Performance Computing University College London April 14-15, 2014 Ulrich Rüde - Lehrstuhl für Simulation 1
Overview Motivation How fast our computers are Some aspects of computer architecture Parallelism everywhere Where we stand: Scalable Parallel Multigrid Matrix-Free Multigrid FE solver Hierarchical Hybrid Grids (HHG) Multiphysics applications with multigrid and beyond Geodynamics: Terra-Neo What I will not talk about today: Electron beam melting Fully resolved 2 and 3-phase bubbly and particulate flows Electroosmotic flows LBM, granular systems, multibody dynamcs, GPUs/accelerators, Medical Applications, Image Processing, Real Time Applications Conclusions Ulrich Rüde - Lehrstuhl für Simulation 2
High Performance Computer Systems (on the way to Exa-Flops) Ulrich Rüde - Lehrstuhl für Simulation 3
Computer Architecture is Hierarchical Core Node Cluster vectorization (SSE, AVX), Several CPU chips (2) may thousands of nodes i.e. vectors of 2-8 floating be combined with local are connected by a point numbers must be memory to become a node fast network treated in blocks: Several cores (8) are on a different network may have their own cache CPU chip topologies memories Within a node we can use between nodes access to local (cache) „shared memory parallelism“ message passing mem fast must be used e.g. OpenMP access to remote mem e.g. MPI Several cores may share slow second/third level caches high latency pipelining, superscalar Memory access bottlenecks low bandwidth execution may occur each core may need Sometimes nodes are several threads to hide equipped with accelerators memory access latency (i.e. graphics cards) Ulrich Rüde - Lehrstuhl für Simulation 4
What will Computers Look Like in 2020? Super Computer (Heroic Computing) Cost: 200 Million € Parallel Threads: 10 8 - 10 9 10 18 FLOPS, Mem: 10 15 -10 17 Byte (1-100 PByte) Power Consumption: 20 MW Departmental Server (Mainstream Computing for R&D) Cost: 200 000 € Parallel Threads: 10 5 - 10 6 10 15 FLOPS, Mem: 10 12 -10 14 Byte (1-100 TByte) Power Consumption: 20 KW (mobile) Workstation (Computing for the Masses) ... scale down by another factor 100 but remember: Predictions are difficult ... especially those about the future Ulrich Rüde - Lehrstuhl für Simulation 5
What Supercomputers are Like Today JUQUEEN SuperMUC System IBM Blue Gene/Q IBM System x iDataPlex Processor IBM PowerPC A2 Intel Xeon E5-2680 8C SIMD QPX (256bit) AVX (256bit) Peak 5 872.0 TFlop/s 3 185.1 TFlop/s Clock 1.6 GHz 2.8 GHz Nodes 28 672 9 216 Node peak 204.8 GFlops/s 358.4 GFlops/s S/C/T per Node 1/16/64 2/16/32 GFlops per Watt 2.54 0.94 SuperMuc: 3 PFlops Ulrich Rüde - Lehrstuhl für Simulation 6
Let‘s try to quantify: What are the limiting resources? Floating Point Operations/sec (Flops) Memory capacity Communication/ memory access, ...? What is the capacity of a contemporary supercomputer? Flops? Memory? Memory and communication bandwidth? What are the resource requirements (e.g. to solve Laplace or Stokes) in Flops/ Dof ? Memory/ Dof ? ... isn‘t it surprising that there are hardly any publications that quantify efficient computing in this form? Ulrich Rüde - Lehrstuhl für Simulation 7
Estimating the cost complexity of FE solvers 10 6 unknowns memory requirement: solution vector: 8 M Bytes plus 3 auxiliary vectors: 32 MBytes stiffness & mass matrix, assume #nnz per row 15 (linear tet elements): 240 MBytes can save O(10) cost by matrix-free implementation assume asymptotically optimal solver (multigrid for scalar elliptic PDE) 100 Flops/unknown efficiency h = 0.1 machine with 1 GFlops, 100 MByte, should solve: 3 × 10 6 unknowns in 3 seconds 1 PFlops, 100 TByte, should solve: 3 × 10 12 unknowns in 3 seconds Ulrich Rüde - Lehrstuhl für Simulation 8
What good are 10 12 Finite Elements? Earth‘s oceans together have ca. 1.3 × 10 9 km 3 We can resolve the volume of the planetray ocean globally with ca. 100m resolution Earth‘s mantle has 0.91 × 10 12 km 3 We can resolve the volume of the mantle with ca. 1 km The human recirculatory system contains ca. 0.006 m 3 volume • discretize with 10 15 finite elements • mesh size of 2 m m • Exa-Scale: 10 3 operations per second and per volume. a red blood cell is ca. 7 m m large we have ca. 2.5 × 10 13 red blood cells • with an exa-scale system we can spend 4 × 10 4 flops per second and per blood cell Ulrich Rüde - Lehrstuhl für Simulation 9
Towards Scalable Algorithms and Data Structures Ulrich Rüde - Lehrstuhl für Simulation 10
What are the problems? Unprecedented levels of parallelism maybe billions of cores/threads needed Hybrid architectures standard CPU vector units (SSE, AVX) accelerators (GPU, Intel Xeon Phi) Memory wall memory response slow: latency memory transfer limited: bandwith Power considerations dictate limits to clock speed => multi core limits to memory size (byte/flop) limits to address references per operation limits to resilience Ulrich Rüde - Lehrstuhl für Simulation 11
Would you want to Designing Algorithms! propel a Superjumbo with four strong jet engines Large Scale Simulation Software or with 1,000,000 blow dryer fans? Moderately Parallel Computing Massively Parallel MultiCore Systems Ulrich Rüde - Lehrstuhl für Simulation 12
The Energy Problem Thought Experiment !"#%1"&'$(7)*+ !"#$%"&'$()*!+ ,&--."/012&"+ ,&--."/012&"+ 10 12 elements/nodes 277 GWh or assume that every entity 3"405/6++(+,)*+ must contribute to any other 0&--."/012&"+ 240 Kilotons TNT (as is typical for an elliptic problem) the picture shows the Badger-Explosion equivalent to either of 1953 with 23 Kilotons TNT multiplication with Source: Wikipedia 12 Text inverse matrix n × n Coulomb from: Exascale Programming Challenges, Report interaction of the 2011 Workshop on Exascale Programming Challenges, Marina del Rey, July 27-29, 2011 Results in 10 24 data movements each 1 NanoJoule (optimistic) Together: 10 15 Joule Ulrich Rüde - Lehrstuhl für Simulation 13
Multigrid for FE on Peta-Scale Computers Ulrich Rüde - Lehrstuhl für Simulation 14
Multigrid: V-Cycle Goal: solve A h u h = f h using a hierarchy of grids Relax on Correct Residual Restrict Interpolate Solve by recursion … … Ulrich Rüde - Lehrstuhl für Simulation 15
16
How fast can we make FE multigrid Parallelize „plain vanilla“ multigrid for tetrahedral finite elements Bey‘s Tetrahedral partition domain Refinement parallelize all operations on all grids use clever data structures matrix free implementation Do not worry (so much) about Coarse Grids idle processors? short messages? sequential dependency in grid hierarchy? Elliptic problems always require global communication. This cannot be accomplished by local relaxation or Krylov space acceleration or domain decomposition without coarse grid Ulrich Rüde - Lehrstuhl für Simulation 17
Hierarchical Hybrid Grids (HHG) Joint work with Frank Hülsemann (now EDF, Paris), Ben Bergen (now Los Alamos), T. Gradl (Erlangen), B. Gmeiner (Erlangen) HHG Goal: Ultimate Parallel FE Performance! unstructured coarse refinement grid with regular substructures for efficiency superconvergence effects matrix-free implementation using regular substructures constant stencil when coefficients are constant assembly-on-the-fly for variable coefficients Ulrich Rüde - Lehrstuhl für Simulation 18
HHG refinement example Input Grid Ulrich Rüde - Lehrstuhl für Simulation 19
HHG Refinement example Refinement Level one Ulrich Rüde - Lehrstuhl für Simulation 20
HHG Refinement example Refinement Level Two Ulrich Rüde - Lehrstuhl für Simulation 21
HHG Refinement example Structured Interior Ulrich Rüde - Lehrstuhl für Simulation 22
HHG Refinement example Structured Interior Ulrich Rüde - Lehrstuhl für Simulation 23
HHG Refinement example Edge Interior Ulrich Rüde - Lehrstuhl für Simulation 24
HHG Refinement example Edge Interior Ulrich Rüde - Lehrstuhl für Simulation 25
Regular tetrahedral refinement Structured refinement of tetrahedra Use regular HHG patches for partitioning the domain (only 2D for simplification) The HHG input mesh is quite large on many cores communication of ghost layers Coarse grid with 132k elements, as assigned to supercomputer Ulrich Rüde - Lehrstuhl für Simulation 26 - Each tetrahedral element ( ≈ 132 k ) was assigned to one Jugen
HHG Parallel Update Algorithm for each vertex do apply operation to vertex end for update vertex primary dependencies for each edge do copy from vertex interior apply operation to edge copy to vertex halo end for update edge primary dependencies for each element do copy from edge/vertex interiors apply operation to element copy to edge/vertex halos end for update secondary dependencies Ulrich Rüde - Lehrstuhl für Simulation 27
Recommend
More recommend