on portability performance and scalability of a mpi
play

On Portability, Performance and Scalability of a MPI OpenCL Lattice - PowerPoint PPT Presentation

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance Computing August 26, 2014 Porto,


  1. On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance Computing August 26, 2014 Porto, Portugal E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 1 / 24

  2. Outline LBM at glance, D2Q37 model 1 OpenCL 2 Implementation details 3 Results and conclusions 4 We addressed the issue of portability of code across several computing architectures preserving performances. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 2 / 24

  3. The D2Q37 Lattice Boltzmann Model Lattice Boltzmann method (LBM) is a class of computational fluid dynamics (CFD) methods simulation of synthetic dynamics described by the discrete Boltzmann equation, instead of the Navier-Stokes equations a set of virtual particles called populations arranged at edges of a discrete and regular grid interacting by propagation and collision reproduce – after appropriate averaging – the dynamics of fluids D2Q37 is a D2 model with 37 components of velocity (populations) suitable to study behaviour of compressible gas and fluids optionally in presence of combustion 1 effects correct treatment of Navier-Stokes , heat transport and perfect-gas ( P = ρ T ) equations 1chemical reactions turning cold-mixture of reactants into hot-mixture of burnt product. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 3 / 24

  4. Computational Scheme of LBM foreach time − step foreach lattice − point propagate ( ) ; endfor foreach lattice − point collide ( ) ; endfor endfor Embarassing parallelism All sites can be processed in parallel applying in sequence propagate and collide. Challenge Design an efficient implementation able exploit a large fraction of available peak performance. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 4 / 24

  5. D2Q37: propagation scheme perform accesses to neighbour-cells at distance 1,2, and 3 generate memory-accesses with sparse addressing patterns E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 5 / 24

  6. D2Q37: boundary-conditions After propagation, boundary conditions are enforced at top and bottom edges of the lattice. 2D lattice with period-boundaries along X -direction at the top and the bottom boundary conditions are enforced: ◮ to adjust some values at sites y = 0 . . . 2 and y = N y − 3 . . . N y − 1 ◮ e.g. set vertical velocity to zero At left and and right edges we apply periodic boundary conditions. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 6 / 24

  7. D2Q37 collision collision is computed at each lattice-cell after computation of boundary conditions computational intensive: for the D2Q37 model requires ≈ 7500 DP floating-point operations completely local: arithmetic operations require only the populations associate to the site computation of propagate and collide kernels are kept separate after propagate but before collide we may need to perform collective operations (e.g. divergence of of the velocity field) if we include computations conbustion effects. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 7 / 24

  8. Open Computing Language (OpenCL) programming framework for heterogenous architectures: CPU+accelerators computing model: ◮ host-code plus one or more kernels running on accelerators ◮ kernels are executed by a set of work-items each processing an item of the data-set (data-parallelism) ◮ work-items are grouped into work-groups , each executed by a compute-unit and processing K work-items in parallel using vector instructions ◮ e.g.: on Xeon-Phi work-groups are mapped on (virtual-)cores processing each up to 8 double-precisions floting-point data memory model identifies a hierarchy of four spaces which differ for size and access-time : private, local, global and constant memory OCL aims to guarantee portability of both code and performances across several architectures E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 8 / 24

  9. OCL Saxpy kernel A , B , C ∈ R n C = s · A × B , s ∈ R , __kernel void saxpy ( __global double ∗ A , __global double ∗ B , __global double ∗ C , const double s ) { / / get global thread ID int id = get_global_id ( 0 ) ; C [ id ] = s ∗ A [ id ] + B [ id ] ; } each work-item executes the saxpy kernel computing just one data-item of the output array first it computes its unique global identifier id and then uses it to address the id th data-item of arrays A, B and C . E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 9 / 24

  10. Memory layout for LB : AoS vs SoA / / l a t t i c e stored as AoS: typedef struct { p1 ; / / population 1 double p2 ; / / population 2 double . . . double p37 ; / / population 37 } pop_t ; pop_t lattice2D [ SIZEX ∗ SIZEY ] ; AoS: corresponding populations of different sites are interleaved, causing strided memory-access and leading to coalescing issues. / / l a t t i c e stored as AoS: typedef struct { p1 [ SIZEX ∗ SIZEY ] ; / / population 1 array double p2 [ SIZEX ∗ SIZEY ] ; / / population 2 array double . . . double p37 [ SIZEX ∗ SIZEY ] ; / / population 37 array } pop_t ; pop_t lattice2D ; SoA: corresponding populations of different sites are allocated at contiguous memory addresses, enabling coalescing of accesses, and making use of full memory bandwidth. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 10 / 24

  11. Grids Layout Uni-dimensional array of NTHREADS, each thread processing one lattice site. Example: physical lattice of 11 × 16 cells; the size of work-groups is 1 × 4. L y = α × N wi , α ∈ N ; ( L y × L x ) / N wi = N wg E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 11 / 24

  12. Hardware used: Eurora prototype Eurora (Eurotech and Cineca) Hot water cooling system Deliver 3,209 MFLOPs per Watt of sustained performance 1 st in the Green500 of June 2013 Computing Nodes: 64 Processor Type: Intel Xeon E5-2658 @ 2.10GHz Intel Xeon E5-2687W @ 3.10GHz Accelerator Type: MIC - Intel Xeon-Phi 5120D GPU - NVIDIA Tesla K20x E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 12 / 24

  13. OpenCL Benchmark of Propagate (Xeon-Phi) Performance of propagate as function of the number of work-items N wi per work-group, and the number of work-groups N wg . E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 13 / 24

  14. OpenCL Benchmark of Collide (Xeon-Phi) Performance of collide as function of the number of work-items N wi per work-group, and the number of work-groups N wg . E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 14 / 24

  15. 2 x NVIDIA K20s GPU Run time on 2 x GPU (NVIDIA K20s) 80 CUDA OpenCL 70 60 [msec] per iteration 50 40 30 20 10 0 Propagate BC Collide E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 15 / 24

  16. 2 x Intel Xeon Phi MIC Run time on 2 x MIC (Intel Xeon Phi) 80 C OpenCL 70 60 [msec] per iteration 50 40 30 20 10 0 Propagate BC Collide E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 16 / 24

  17. Propagate Run time (Propagate - 1920x2048 lattice) 200 C C Opt. CUDA OpenCL 150 [msec] per iteration 100 50 0 MIC GPU CPU2 CPU3 E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 17 / 24

  18. Collide Run time (Collide - 1920x2048 lattice) 600 C C Opt. CUDA 500 OpenCL [msec] per iteration 400 300 200 100 0 MIC GPU CPU2 CPU3 E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 18 / 24

  19. Scalability on Eurora Nodes Weak regime lattice size: 256 × 8192 × No_devices. Strong regime lattice size: 1024 × 8192. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 19 / 24

  20. Limitations to strong scalability E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 20 / 24

  21. Conclusions we have presented an OCL implementation of a fluid-dynamic simulation 1 based on Lattice Boltzmann methods code portability : it has been succesfully ported and run on several 2 computing architectures, including CPU, GPU and MIC systems performance portability : results are of the same level of codes written 3 using more “native” programming frameworks, such as CUDA or C the good news: this results make OpenCL a good framework to develop 4 code, easily portable across several architecture preserving performances the bad news: not all vendors are today commited to support this standard 5 because considered a low-level approach. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, 2014 21 / 24

Recommend


More recommend