Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems Renato N. Elias Jose J. Camata Albino A. Aveleda Alvaro L. G. A. Coutinho High Performance Computing Center (NACAD) Federal University of Rio de Janeiro (UFRJ) Rio de Janeiro, Brazil Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Summary Motivation; EdgeCFD: software features; Parallel models; MPI collective; MPI peer-to-peer; Threaded parallelism; Benchmark systems; Benchmark problem; Results; Conclusions. Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Motivations Petaflop computing poses Largest South America research system new challenges and paradigms; Compromise with continuous performance improvements in our software; Understand some hardware and software issues; Check Intel Xeon processor 6464 Nehalem cores Preliminary tests: 65Tflops evolution. Final configuration: 7200 cores Main point is: What is happening? Why modern multi-core systems do not, naturally, give us better performances? Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
EdgeCFD Main Features EdgeCFD: A parallel and general purpose CFD solver Finite Element Method; SUPG/PSPG formulation for incompressible flow; SUPG/YZ β formulation for advection diffusion; Edge-based data structure; Hybrid parallel (MPI, OpenMP or both); u - p fully coupled flow solver; Free-surface flows (VoF and Level Sets); Adaptive time step control; Inexact-Newton solver; Dynamic deactivation; Mesh ordering according to aimed architecture. Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
EdgeCFD in action Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Governing Equations Incompressible Navier Stokes Equation ∂ u ∂t + u · ∇ u + 1 ρ ∇ p − ∇ · ( ν · ∇ u ) = f in Ω × [0 , t f ] (1) ∇ · u = 0 in Ω × [0 , t f ] (2) Advection-Diffusion Transport Equation ∂φ ∂t + u · ∇ φ − ∇ · ( K · ∇ φ ) = 0 em Ω × [0 , t f ] (3) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Stabilized Finite Element Formulation (1/2) SUPG/PSPG Formulation � ∂ u h � � � � � w h · ρ + u h · ∇ u h − f w h · h d Γ+ � w h � � p h , u h � q h ∇ · u h d Ω d Ω + ε : σ d Ω − ∂t Ω Ω Ω Γ h nel ∂ u h � � � � � τ SUP G u h · ∇ w h + u h · ∇ u h � p h , u h � d Ω e � + ρ − ∇ · σ − ρ f ∂t e =1 Ω e nel � � ∂ u h � � � τ P SP G ∇ q h · + u h · ∇ u h � p h , u h � d Ω e � + ρ − ∇ · σ − ρ f ρ ∂t e =1 Ω e nel � τ LSIC ∇· w h ρ ∇ · u h Ω e = 0 � + (4) e =1 Ω e Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Stabilized Finite Element Formulation (2/2) SUPG/yz β Formulation � ∂φ h � � � � ∂t + u h · ∇ φ h ∇ w h · K · ∇ φ h d Ω − w h h h d Γ w h d Ω+ Ω Ω Γ h n el � ∂φ h � � τ SUPG u h · ∇ w h ∂t + u h · ∇ φ h − ∇ · K · ∇ φ h � + d Ω e =1 Ω e n el � � δ ( φ ) ∇ w h · ∇ φ h d Ω = � w h f d Ω + (5) e =1 Ω e Ω Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
EdgeCFD Time Stepping and Main Loops Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Parallel Models in EdgeCFD (Summary) (1) MPI based on collective communication pattern; (2) MPI based on peer-to-peer (P2P) pattern; (3) OpenMP in hot loops (edge matrices assembling and matrix-vector products); (4) Hybrid = OpenMP combined to (1) or (2); Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Parallel Models MPI collective 1. All shared equations are synchronized in just one collective operation; 2. Easy to implement; 3. Poor performance for massive parallelism; 4. Some (small) improvements can be done ( ...but there’s no miracle... ). (a) Original mesh (b) Redundant communication Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Parallel Models MPI peer-to-peer (P2P) 1. Tedious to implement: Neighbouring relationships; Clever message exchanges scheduling; Computation-communication overlap opportunities; 2. Very efficient ( if correctly implemented, of course... ); 3. More suitable to massive parallelism. (a) Mesh Partition (b) Communication graph Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Parallel Models Threaded parallelism 1. Easy to implement; 2. Performance depends on compiler, hardware and implementation ( hardware and compilers are getting better... ); 3. Being survived on many-core processors and GPU computing; 4. On EdgeCFD: Employed only in main kernels (e. g., matrix-vector product); Mesh coloring employed to remove memory dependences. (a) Original mesh (b) Colored mesh Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Putting All Together Hybrid Matrix-Vector Product Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Benchmark systems Hardware description Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Benchmark Problem Rayleigh-Benard (Ra=30,000, Pr=0.71) Assumptions: 1. All mesh entities were reordered to explore memory locality [1]; 2. Same mesh ordering to all systems; 3. Same compiler (Intel) and compilation flags (-fast) Mesh sizes MSH1 MSH2 Tetrahedra 178,605 39,140,625 Nodes 39,688 7,969,752 Edges 225,978 48,721,528 Flow equations 94,512 31,024,728 Transport equations 36,080 7,843,248 Natural convection problem [1] COUTINHO et al, IJNME (66):431-460, 2006; Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Tests summary Parallel model speedup per number of cores (intra-node) Serial performance per processor Parallel performance per processor (intra-node) MPI process placement (inter-nodes) Large scale run (case study) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Parallel models Speedup per number of cores (intra-node) Clovertown x Nehalem SGI Altix-ICE (Clovertown) Nehalem server (i7) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Results CPU comparison Serial performance Intra-node performance (a) CPU (serial run) (b) 8 cores (2 CPUs × 4 cores in 1 node) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Results Process placement effect (Cluster Dell/Harpertown) Speedup considering best process placement Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
THE MULTICORE DILEMMA: SANDIA and TACC’s statements: “...more chip cores can mean slower supercomputing...”; “...16 multicores perform barely as well as two for complex applications...”; “...Process placement in multi-core processors has a strong influence on performance....”; “...more cores on a single chip don’t necessarily mean faster clock speeds...”. Supermarket’s analogy: (by Sandia) If two clerks at the same checkout counter are processing your food instead of one, the checkout process should go faster. The problem is, if each clerk doesn’t have access to the groceries, he doesn’t necessarily help the process. Worse, the clerks may get in each other’s way. SOURCE: Diamond, J. et al, ‘‘Multicore Optimization for Ranger’’, TACC, 2009 https://share.sandia.gov/news/resources/news_releases/more-chip-cores-can-mean-slower-supercomputing- sandia-simulation-shows/ Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Results Large scale run on 64 cores (case study using MSH2) About 40M tets, 8M nodes, 50M edges, 31M flow equations, 8M transport equations Time spent in 10 time steps Communication graph Tests performed on Cluster Dell (Harpertown) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Conclusions Older Intel Xeon processors dramatically suffer when large workloads are imposed to a single CPU Consequence: processor placement has great influence in performance ( ...sadly, we should not fill up our nodes... ); Nehalem (Core I7) has several improvements over its predecessors; Third level shared cache ( now Intel has a true quad core... ); Fast interconnect channel among processors ( well, sounds like AMD Hyper Transport... ). Peer-to-peer MPI model is the best strategy to reach good parallel performance in EdgeCFD; OpenMP performance in EdgeCFD is still poor, but it’s getting better ( for the same implementation! ); Many-core and GPU paradigms are bringing back threaded parallelism... Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Special Thanks to Dell Brazil ( Cluster Dell ) Intel Brazil ( Nehalem servers ) High Performance Computing Center (NACAD) ( Altix-ICE and infrastructure ) Texas Advanced Computing Center (TACC) ( Ranger ) All of you that attended this presentation The Brazilian soccer team ( for winning today ) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems Renato N. Elias Jose J. Camata Albino A. Aveleda Alvaro L. G. A. Coutinho High Performance Computing Center (NACAD) Federal University of Rio de Janeiro (UFRJ) Rio de Janeiro, Brazil Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)
Recommend
More recommend