Improving 3D Lattice Boltzmann Method with asynchronous transfers on many-core processors Minh Quan HO 1 , 3 , Bernard TOURANCHEAU 1 , Christian OBRECHT 2 , ıt DUPONT DE DINECHIN 3 and Julien HASCOET 3 Benoˆ 1 LIG UMR 5217 - Grenoble Alps University - Grenoble, France 2 CETHIL UMR 5008 - INSA-Lyon - Villeurbanne, France 3 Kalray S.A. - Montbonnot, France CCDSC - October 03-06, 2016 1 / 27
Overview Introduction 1 Motivation 2 Kalray MPPA-256 architecture 3 Pipelined 3D LBM stencil 4 Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth Results 5 Conclusions 6 2 / 27
Introduction - LBM theory The Lattice Boltzmann Method performs on a regular Cartesian grid: mesh size δ x constant time step δ t A node = { particle densities f α , velocities ξ α } Nodes are linked by e.g the D3Q19 stencil and updated by [He, 1997]: �� �� � � � � � f α ( x + δ t ξ α , t + δ t ) − � f α ( x , t ) = Ω � f α ( x , t ) (1) � � � 15 11 12 5 16 8 3 7 2 1 10 9 4 17 14 6 13 18 3 / 27
Introduction - Memory bound context Given a ’square’ fluid represented as a grid of L × L × L lattice nodes in D3Q19, evoluating throught T time steps. OPAL MPPA Roofline 30 Simulating the whole domain requires: s B / G 19 × 2 × L 3 × T floating-point numbers 0 0 2 h t 28 d w i d s n / Performance in log(flops/sec) B a G raw performance (634 GFLOPS SP) b ≤ 400 × L 3 × T floating-point ops. 0 4 ) s h B / t d G i w 5 d performance 200 GFLOPS SP 26 2 . n ( a b M A performance 100 GFLOPS SP E Moving data is much slower than computing R T S h t d i w d 24 n today. a b k a e p 22 GPU is until now the most well-suited for OPAL kernel log(AI) = log(2.34) LBM. 20 0 2 4 6 8 10 12 Arithmetic intensity in log(flops/byte) 4 / 27
Motivation Power-efficient NoC-based many-core processors are very promising for next HPC challenges (e.g Sunway, MPPA, PULP, STHORM ...). Good latency, but low memory bandwidth (DDR3). Lack of efficient programming model and optimization methods. High { computing | data } predictability and fast-local-memory centric. Enabling sophisticated optimizations, based on software-prefetching and streaming. These motivates us to study a pipelined 3D LBM algorithm on many-core processors, using local memory and asynchronous communication. 5 / 27
Kalray MPPA-256 architecture 16 x 16-core Compute Clusters (CC) 2 x I/O clusters with quad-core CPUs, DDR3, Ethernet, PCIe Dual 2D-torus NoC for 24 GB/s per link @ 600 MHz Peak 634 GFLOPS SP for 25W @ 600 MHz 2 MB multi-banked shared memory per CC, 77 GB/s bandwidth SMEM configurable as DDR L2 cache, or explicit user buffers Support asynchronous data transfer by DMA engines POSIX C/C++ programming or OpenCL offloading 6 / 27
Outline Introduction 1 Motivation 2 Kalray MPPA-256 architecture 3 Pipelined 3D LBM stencil 4 Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth Results 5 Conclusions 6 7 / 27
Domain decomposition and macro pipeline We take the lid-driven cavity example from the OPAL solver [Obrecht, 2015], originally implemented in OpenCL The L x × L y × L z domain is decomposed to sub-domains of size C x × C y × C z 8 / 27
Domain decomposition and macro pipeline Main idea: Cluster 0 async_copy_3D A sub-domain is copied into CC’s local memory by a 3D asynchronous copy function async_copy_3D Computation is carried out on local memory then data are copied back to global memory (DDR) 9 / 27
Domain decomposition and macro pipeline Cluster 0 async_copy_3D Requires copying halo layers for each sub-domain async_copy_3D In 1-order stencil, the copied sub-domain S is at most ( C x + 2) × ( C y + 2) × ( C z + 2) 10 / 27
Domain decomposition and macro pipeline 16 computing clusters, each is working on NB CUBES PER CLUSTER sub-domains: Cluster 0 async_copy_3D /* Prologue */ prefetch_cube (0); // non -blocking /* Pipeline */ for i in 0 .. NB_CUBES_PER_CLUSTER -1 async_copy_3D prefetch_cube (i+1); // non -blocking wait_cube(i); compute_cube (i); put_cube(i); done /* Epilogue */ wait_cube( NB_CUBES_PER_CLUSTER -1); 11 / 27
Outline Introduction 1 Motivation 2 Kalray MPPA-256 architecture 3 Pipelined 3D LBM stencil 4 Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth Results 5 Conclusions 6 12 / 27
Sub-domain addressing A : “Hey, don’t touch my cube !” B : “No, that’s mine.” Cluster 0 async_copy_3D async_copy_3D a a Credit : 9gag 13 / 27
Sub-domain addressing iblockx iblocky 0 1 4 5 Space filling curves like Morton 2 3 6 7 or Hilbert are fast 8 9 12 13 10 11 14 15 14 / 27
Sub-domain addressing iblockx iblockx iblocky iblocky ? ? 0 0 1 1 4 4 5 5 ? ? Space filling curves like Morton 2 2 3 3 6 6 7 7 or Hilbert are fast ? ? 8 8 9 9 12 12 13 13 But, what if (sub-)domains are ? ? 10 10 11 11 14 14 15 15 not cubic ? ? ? ? ? ? ? 15 / 27
Sub-domain addressing iblockx iblockx iblocky iblocky ? ? 0 0 1 1 4 4 5 5 ? ? Space filling curves like Morton 2 2 3 3 6 6 7 7 or Hilbert are fast ? ? 8 8 9 9 12 12 13 13 But, what if (sub-)domains are ? ? 10 10 11 11 14 14 15 15 not cubic ? ? ? ? ? ? ? Such a curve that works for any configuration will be more complex (octree, recursion, trailing handle) 16 / 27
Sub-domain addressing iblockx iblockx iblocky iblocky ? ? 0 0 1 1 4 4 5 5 ? ? Space filling curves like Morton 2 2 3 3 6 6 7 7 or Hilbert are fast ? ? 8 8 9 9 12 12 13 13 But, what if (sub-)domains are ? ? 10 10 11 11 14 14 15 15 not cubic ? ? ? ? ? ? ? Such a curve that works for any configuration will be more complex (octree, recursion, trailing handle) Addressing sub-domains in ’3D’ row-major style 17 / 27
Outline Introduction 1 Motivation 2 Kalray MPPA-256 architecture 3 Pipelined 3D LBM stencil 4 Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth Results 5 Conclusions 6 18 / 27
Sub-domain size and Halo bandwidth We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells. Halo bandwidth 1.0 Halo bandwidth ratio 0.8 0.6 g ( C x ) = ( C x + 2 ) 3 − C x 2 ( C x + 2 ) 3 0.4 0.2 0.0 2 8 16 32 64 96 Cube size (Cx = Cy = Cz) 19 / 27
Sub-domain size and Halo bandwidth We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells. Halo bandwidth 1.0 Halo bandwidth ratio 0.8 0.6 g ( C x ) = ( C x + 2 ) 3 − C x 2 ( C x + 2 ) 3 0.4 0.2 0.0 2 8 16 32 64 96 Cube size (Cx = Cy = Cz) Which size for sub-domains, given a limited local memory ? 20 / 27
Sub-domain size and Halo bandwidth We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells. Halo bandwidth 1.0 Halo bandwidth ratio 0.8 0.6 g ( C x ) = ( C x + 2 ) 3 − C x 2 ( C x + 2 ) 3 0.4 0.2 0.0 2 8 16 32 64 96 Cube size (Cx = Cy = Cz) Which size for sub-domains, given a limited local memory ? E.g double buffering : malloc (2 × ( C x + 2) 3 × sizeof ( float )) ( C x = C y = C z ) 21 / 27
Sub-domain size and Halo bandwidth We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells. Halo bandwidth 1.0 Halo bandwidth ratio 0.8 0.6 g ( C x ) = ( C x + 2 ) 3 − C x 2 ( C x + 2 ) 3 0.4 0.2 0.0 2 8 16 32 64 96 Cube size (Cx = Cy = Cz) Which size for sub-domains, given a limited local memory ? E.g double buffering : malloc (2 × ( C x + 2) 3 × sizeof ( float )) ( C x = C y = C z ) Sub-domains should be cubic and as big as possible. 22 / 27
Outline Introduction 1 Motivation 2 Kalray MPPA-256 architecture 3 Pipelined 3D LBM stencil 4 Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth Results 5 Conclusions 6 23 / 27
Results (1/2) We compare original OPAL performance on Intel CPU, Intel MIC, NVIDIA GPU and Kalray MPPA-256 (all OpenCL). OPAL OpenCL. Duration = 1000, OPAL OpenCL. Duration = 1000, OPAL OpenCL. Duration = 1000, Relative throughput vs. GPU−STREAM (%) Workgroup = 32x1x1 Workgroup = 32x1x1 Workgroup = 32x1x1 500 Tesla C2070 Tesla C2070 Power efficiency (MLUPS/Watt) Tesla C2070 100 2.0 Xeon E5−2667 v3 Xeon E5−2667 v3 Xeon E5−2667 v3 Performance (MLUPS) 400 Xeon Phi 3100 Xeon Phi 3100 Xeon Phi 3100 MPPA−256 Bostan MPPA−256 Bostan MPPA−256 Bostan 80 1.5 300 60 1.0 200 40 0.5 100 20 0.0 0 0 32 64 96 128 160 192 224 256 32 64 96 128 160 192 224 256 32 64 96 128 160 192 224 256 Cavity size Cavity size Cavity size (a) Performance in (b) Relative throughput (c) Power efficiency MLUPS to GPU-STREAM (%) (MLUPS/W) Figure: Original OPAL OpenCL on GPU, CPU, MIC and MPPA GPU-STREAM benchmark [Deakin, 2015] 24 / 27
Recommend
More recommend