STELLA: A Domain Specific Language for Stencil Computations Carlos Osuna, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology, Meteoswiss Zurich Mauro Bianco, Swiss National Supercomputing Centre, CSCS Lugano. Thomas C. Schulthess, Swiss National Supercomputing Centre, CSCS Lugano. Work funded by HP2C and PASC initiatives Carlos Osuna Dagstuhl Seminar, 14th April 2015 1
Outline Dynamical Core of COSMO STELLA: DSL using C++ template metaprogramming STELLA DSL elements Exploiting performance: Loop & kernel fusion, data locality Extending Parallelization models Carlos Osuna Dagstuhl Seminar – 14th April 2015 2
MOTIVATION: COSMO MODEL Dynamical core of COSMO solves the Navier Stokes equations using finite difference methods on structured grids. Stencils on 3D grids: 2-7 km resolution in the horizontal, 60/80 levels in the vertical Split-explicit time integrator with slow processes integrated using Runge-Kutta method. Collection of (Fortran) operators (advection, diffusion, fast waves, etc.) on prognostic variables ( ρ , u, v, w, T, qx, ...) Carlos Osuna Dagstuhl Seminar – 14th April 2015 3
COSMO tjme step Initialization Boundary conditions Physics Dynamics Data assimilation Halo-update Δt Diagnostics I/O Cleanup Carlos Osuna Dagstuhl Seminar– 14th April 2015 4
COSMO Dynamical Core Dynamical code: ~35 stencils per time step Initialization Vertical Diffusion (compute U, V, W) S t U a r Boundary conditions , t V c , o Vertical Diffusion W Physics m m (compute T) Dynamics W U a i , t Data assimilation …. V c , o W m Halo-update m Δt Time Integrator Diagnostics N small ∆ t (FastWaves RungeKutta) I/O …. Horizontal Diffusion Cleanup Carlos Osuna Dagstuhl Seminar – 14th April 2015 5
Algorithmic Motifs In COSMO Horizontal PDE operators explicitly solved -> compact stencils with horizontal data dependencies for k=1, ke { for i=1, ie-1 { for j=1, je-1 { lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) } } for i=1, ie-1 { for j=1, je-1 { difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1) } } } Carlos Osuna Dagstuhl Seminar– 14th April 2015 6
Algorithmic Motifs In COSMO Vertical PDE operators implicitly solved -> tridiagonal systems // forward for j=1, je for i=1, ie for k=2, ke { c(i,j,k) = 1.0 / b(i,j,k) - c(i,j,k-1) * a(i,j,k) ) d(i,j,k) = ( d(i,j,k) - d(i,j,k-1) * a(i,j,k) ) * c(i,j,k) } } } // backward substjtutjon for j=1, je { for i=1, ie { for k=2, ke { x(i,j,k) = d(i,j,k) - c(i,j,k) * x(i,j,k+1) } } } Carlos Osuna Dagstuhl Seminar – 14th April 2015 7
Model development U = U ( x, z,t ) U ( x, z ,t = 0 )= U 0 ( x, z ) ∂ U 4 U =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ for i=1, ie-1 for j=1, je-1 lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) for i=1, ie-1 for j=1, je-1 difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1) Carlos Osuna Dagstuhl Seminar – 14th April 2015 8
U = U ( x, z,t ) U ( x, z ,t = 0 )= U 0 ( x, z ) ∂ U 4 U =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ for i=1, ie-1 for j=1, je-1 lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) for i=1, ie-1 for j=1, je-1 difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1) Carlos Osuna Dagstuhl Seminar– 14th April 2015 9
U = U ( x, z,t ) U ( x, z ,t = 0 )= U 0 ( x, z ) ∂ U 4 U =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ ? Carlos Osuna Dagstuhl Seminar – 14th April 2015 10
2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ static void Lap(Context ctx) { ctx[lap::Center()] = ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)] - 4*ctx[u::Center()]; } Carlos Osuna Dagstuhl Seminar – 14th April 2015 11
const int i = threadIdx.x; const int j = threadIdx.y; int i_h = 0; int j_h = 0; if(j < 2) { i_h = i; j_h = (j==0 ? -1 : blockDim.y); } else if(j < 4 && i <= blockDim.y) { i_h = (j==2 ? -1 : blockDim.x); j_h = i; } for(int k=0; k < kdim; ++k) { lap(i,j) = - 4.0 * U(i,j,k) + U(i+1,j,k) + U(i-1,j,k) + U(i,j+1,k) + U(i,j-1,k) if(i_h != 0 || j_h != 0) lap(i_h, j_h) = - 4.0 * U(i_h,j_h,k) + U(i_h+1,j_h,k) + U(i_h-1,j_h,k) + U(i_h,j_h+1,k) + U(i_h,j_h-1,k) __syncthreads(); result(i,j) = -4*(i,j,k) - alpha(i,j,k)*( lap(i,j,k) - lap(i-1,j,k) + lap(i,j,k) - lap(i,j-1,k)) } Carlos Osuna Dagstuhl Seminar – 14th April 2015 12
STELLA DSL using C++ template metaprogramming Separate model and algorithm from hardware specifjc implementatjon and optjmizatjons. Single source code compiling multjple architectures, performance portable. Concise syntax / close to mathematjcal descriptjon of operators Same toolchain as the full model – (access to debugger/data fjelds, interoperability with other programming languages) Supports CPU & GPU backends Used to rewrite a full dynamical core (COSMO) Carlos Osuna Dagstuhl Seminar – 14th April 2015 13
How it works? Operators are encapsulated as type informatjon (functors) Types are composed/organized and manipulated template<typename Context> struct LapStage{ static void Do(Context ctx) { ctx[lap::Center()] = ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)] - 4*ctx[T::Center()]; } }; Specifjc backends expand the code of operators into template kernels with loop structures Many optjmizatjons (memory layout, tjling, looping, data locality) depend on the backend Carlos Osuna Dagstuhl Seminar– 14th April 2015 14
Stencil Stages Stencil Stages describe the single operatjon applied to each grid point. Assumed parallel executjon in the IJ plane template<typename Tenv> struct Lap { STENCIL_STAGE(TEnv) STAGE_PARAMETER(u) STAGE_PARAMETER(lap) statjc void Do(Context ctx, FullDomain) { ctx[lap::Center()] = -4*ctx[u::Center()] + ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)]; } } Carlos Osuna Dagstuhl Seminar – 14th April 2015 15
Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Instantiate data fields pack_parameters( (memory layout abstracted, Param<in, cIn, cDataField>( in_ ), backend dependent) Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Carlos Osuna Dagstuhl Seminar– 14th April 2015 16
Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Pack the parameters and associate pack_parameters( them to place holders used in stages Param<in, cIn, cDataField>( in_ ), Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Carlos Osuna Dagstuhl Seminar – 14th April 2015 17
Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Define temporary buffers used by the pack_parameters( stencil Param<in, cIn, cDataField>( in_ ), Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Carlos Osuna Dagstuhl Seminar – 14th April 2015 18
Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Compose multiple stencil stages. pack_parameters( ( Operations with data dependencies placed Param<in, cIn, cDataField>( in_ ), in different stages ) Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( Build Sweeps (sequential loops in K ) StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Dagstuhl Seminar – 14th April 2015 19 Carlos Osuna
Computation on the fly or buffering? 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∂ U ∇ 4 =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 Δ 4 T 2 U ∇ ∇ 2 T ∇ Carlos Osuna Dagstuhl Seminar– 14th April 2015 20
Recommend
More recommend