interoperability of shared memory parallel programming
play

Interoperability of Shared Memory Parallel Programming Models with - PowerPoint PPT Presentation

Interoperability of Shared Memory Parallel Programming Models with Charm++ Jmin Choi University of Illinois Urbana-Champaign May 2, 2019 1 / 16 Overview 2. Compiling Libraries 3. Creating Hybrid Programs 4. Vector Addition Example 5.


  1. Interoperability of Shared Memory Parallel Programming Models with Charm++ Jæmin Choi University of Illinois Urbana-Champaign May 2, 2019 1 / 16

  2. Overview 2. Compiling Libraries 3. Creating Hybrid Programs 4. Vector Addition Example 5. Kokkos vs. RAJA 6. Future Work 2 / 16 1. Why Interoperate with Charm++?

  3. Why Interoperate with Charm++? 3 / 16 ▶ Kokkos (SNL) and RAJA (LLNL) ▶ ’Performance portability’ ▶ Abstractions for parallel execution and data management ▶ Limited to shared memory parallelism by itself ▶ Use MPI for distributed memory execution ▶ Charm++ is another option ▶ Support for wide variety of architectures ▶ Load balancing

  4. Basic Interoperability (intra- & inter-node) 4 / 16 ▶ Let Kokkos/RAJA handle shared memory parallelism ▶ OpenMP backend for CPU ▶ CUDA backend for GPU ▶ Use Charm++ for communication between processes

  5. Compilation: Kokkos \ install are all we need i n s t a l l make kokkoslib included NVCC wrapper > to mkdir 5 / 16 path to b u i l d && cd b u i l d build > \ . . / generate_makefile . bash − − p r e f i x =<absolute − − with − cuda=<path to CUDA t o o l k i t > − − with − cuda − options=enable_lambda − − with − openmp − − arch=<CPU arch >,<GPU arch > − − compiler=<path make − j ▶ Assume GPUs are available ▶ OpenMP and CUDA backends ▶ Headers ( build/include ) and library fjle ( build/lib ) after

  6. Compilation: RAJA i n s t a l l after install are all we need i n s t a l l make . . / mkdir folder > RAJA to b u i l d && cd i n s t a l l b u i l d && mkdir 6 / 16 cmake − DENABLE_CUDA=On − DCMAKE_INSTALL_PREFIX=<path make − j ▶ Assume GPUs are available ▶ OpenMP and CUDA backends ▶ Headers ( install/include ) and library fjle ( install/lib )

  7. Creating a Kokkos/RAJA + Charm++ Hybrid Program (if CUDA backend not built) examples/charm++/shared_runtimes/[kokkos,raja] 7 / 16 ▶ Write Kokkos/RAJA code in a .cpp fjle ▶ Can be put in the same fjle as Charm++ if GPU is not used ▶ Write Charm++ code in a separate .C fjle ▶ A nodegroup chare for each Kokkos/RAJA instance ▶ Compile Kokkos/RAJA code with NVCC ▶ Additional options needed (e.g. -fopenmp ) ▶ Use NVCC wrapper with Kokkos ▶ Use charmc to compile Charm++ code and link ▶ Need to link Kokkos/RAJA library ▶ Examples (Hello World, vector addition) in

  8. Vector Addition Example: Kokkos nodegroup Listing 1: vecadd.ci } } void run ( ) ; entry Process ( ) ; entry { Process i n s t a n c e / process mainmodule vecadd a Kokkos E n c a p s u l a t e / / } . . . { mainchare Main . . . { 8 / 16

  9. Vector Addition Example: Kokkos / / use_gpu vecadd ( n , CkMyNode ( ) , use_gpu ) ; k o k k o s F i n a l i z e ( ) ; / / C a l l s Kokkos : : f i n a l i z e ( ) i n t e r n a l l y C o n t r i b u t e uses CUDA to Main to end the program . . . } } Listing 2: vecadd_charm.C i f d e f a u l t , . . . / / class Process : public CBase_Process { public : Process ( ) { k o k k o s I n i t ( ) ; C a l l s Uses OpenMP by Kokkos : : i n i t i a l i z e ( ) i n t e r n a l l y } void run ( ) { / / Execute v e c t o r a d d i t i o n / / 9 / 16

  10. Vector Addition Example: Kokkos { a ( i ) += b ( i ) ; } } . . . void vecadd ( const ui n t 6 4 _ t n , i n t process , bool use_gpu ) HostView h_a ( ” Host A” , n ) ; const CudaView d_a ( ” Device A” , n ) ; CudaView d_b ( ” Device B” , n ) ; Kokkos : : p a r a l l e l _ f o r ( Kokkos : : RangePolicy <Kokkos : : Cuda >(0 , n ) , Compute<CudaView >( d_a , d_b ) ) ; Kokkos : : deep_copy ( h_a , d_a ) ; } Listing 3: vecadd_kokkos.cpp { i n t & i ) #include <Kokkos_Core . hpp> ( const . . . / / Views Kokkos : : CudaSpace > CudaView ; Kokkos : : CudaHostPinnedSpace > HostView ; / / F u n c t o r s template < typename ViewType > struct Compute { ViewType a , b ; Compute ( const ViewType& d_a , const ViewType& d_b ) : a ( d_a ) , b ( d_b ) { } KOKKOS_INLINE_FUNCTION void operator ( ) 10 / 16 typedef Kokkos : : View < double * , Kokkos : : LayoutLeft , typedef Kokkos : : View < double * , Kokkos : : LayoutRight ,

  11. Vector Addition Example: RAJA nodegroup Listing 4: vecadd.ci } } void run ( ) ; entry Process ( ) ; entry { Process i n s t a n c e / process mainmodule vecadd a RAJA E n c a p s u l a t e / / } . . . { mainchare Main . . . { 11 / 16

  12. Vector Addition Example: RAJA to uses CUDA i f use_gpu vecadd ( n , CkMyNode ( ) , use_gpu ) ; / / C o n t r i b u t e Main Uses OpenMP by to end the program . . . } } Listing 5: vecadd_charm.C d e f a u l t , / / . . . { class Process : public CBase_Process { public : Process ( ) / / a d d i t i o n No i n i t i a l i z a t i o n / cleanup needed } void run ( ) { / / Execute v e c t o r 12 / 16

  13. Vector Addition Example: RAJA [ = ] Listing 6: vecadd_raja.cpp } cudaMemcpyDeviceToHost ) ) ; d_a , cudaErrchk ( cudaMemcpy ( h_a , } ) ; d_a [ i ] += d_b [ i ] ; { i ) ( i n t RAJA_DEVICE RAJA : : f o r a l l <RAJA : : cuda_exec <256>>( RAJA : : RangeSegment ( 0 , n ) , { void vecadd ( const bool use_gpu ) process , i n t ui n t 6 4 _ t n , 13 / 16 double * h_a , * d_a , * d_b ; cudaErrchk ( cudaMallocHost ( ( void ** )& h_a , n * sizeof ( double ) ) ) ; cudaErrchk ( cudaMalloc ( ( void ** )& d_a , n * sizeof ( double ) ) ) ; cudaErrchk ( cudaMalloc ( ( void ** )&d_b , n * sizeof ( double ) ) ) ; n * sizeof ( double ) ,

  14. Kokkos vs. RAJA kernels management 14 / 16 ▶ Both allow C++ functors and lambdas for computation ▶ Kokkos needs initialize and fjnalize calls ▶ Kokkos provides the View abstraction for memory ▶ Explicit memory management in RAJA ▶ No performance difgerence in vector addition

  15. Future Work node? 15 / 16 ▶ What if we want more than one Kokkos/RAJA instance per ▶ In NUMA environments, etc. ▶ Should be able to pin Charm++ processes to a set of cores ▶ A more involved integration with Charm++ scheduler ▶ Other shared memory parallel frameworks: StarPU, OmpSS ▶ Performance comparison with standardized set of benchmarks

  16. Thank You 16 / 16

More recommend