Kokkos: The C++ Performance Portability Programming Model Christian - PowerPoint PPT Presentation

Kokkos: The C++ Performance Portability Programming Model Christian Trott (crtrott@sandia.gov), Carter Edwards D. Sunderland, N. Ellingwood, D. Ibanez, S. Hammond, S. Rajamanickam, K. Kim, M. Deveci, M. Hoemmen, G. Center for Computing Research, Sandia National Laboratories, NM SAND2017-4935 C Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.

New Programming Models § HPC is at a Crossroads § Diversifying Hardware Architectures § More parallelism necessitates paradigm shift from MPI-only § Need for New Programming Models § Performance Portability: OpenMP 4.5, OpenACC, Kokkos, RAJA, SyCL, C++20?, … § Resilience and Load Balancing: Legion, HPX, UPC++, ... § Vendor decoupling drives external development 2

New Programming Models § HPC is at a Crossroads § Diversifying Hardware Architectures § More parallelism necessitates paradigm shift from MPI-only § Need for New Programming Models § Performance Portability: OpenMP 4.5, OpenACC, Kokkos, RAJA, SyCL, C++20?, … § Resilience and Load Balancing: Legion, HPX, UPC++, ... § Vendor decoupling drives external development What is Kokkos? What is new? Why should you trust us? 3

Kokkos: Performance, Portability and Productivity LAMMPS# Trilinos# Sierra# Albany# Kokkos# HBM# HBM# HBM# HBM# DDR# DDR# DDR# DDR# DDR# https://github.com/kokkos

Performance Portability through Abstraction Separating of Concerns for Future Systems… Kokkos Parallel Execution Data Structures Memory Spaces (“Where”) Execution Spaces (“Where”) - Multiple-Levels - N-Level - Logical Space (think UVM vs explicit) - Support Heterogeneous Execution Memory Layouts (“How”) Execution Patterns (“What”) - Architecture dependent index-maps - parallel_for/reduce/scan, task spawn - Also needed for subviews - Enable nesting Memory Traits Execution Policies (“How”) - Access Intent: Stream , Random, … - Range, Team, Task-Dag - Access Behavior: Atomic - Dynamic / Static Scheduling - Enables special load paths: i.e. texture - Support non-persistent scratch-pads

Capability Matrix Implementation Parallel Loops Reduction Parallel Loops Tightly Nested Nested Loops Non-tightly Task Parallelism Data Allocations Data Transfers Abstractions Advanced Data Technique Kokkos C++ Abstraction X X X X X X X X OpenMP Directives X X X X X X X - OpenACC Directives X X X X - X X - CUDA Extension (X) - (X) X - X X - OpenCL Extension (X) - (X) X - X X - C++AMP Extension X - X - - X X - Raja C++ Abstraction X X X (X) - - - - TBB C++ Abstraction X X X X X X - - C++17 Language X - - - (X) X (X) (X) Fortran2008 Language X - - - - X (X) - 6

Example: Conjugent Gradient Solver § Simple Iterative Linear Solver § For example used in MiniFE § Uses only three math operations: § Vector addition (AXPBY) § Dot product (DOT) § Sparse Matrix Vector multiply (SPMV) § Data management with Kokkos Views: View<double*,HostSpace,MemoryTraits<Unmanaged> > h_x(x_in, nrows); View<double*> x("x",nrows); deep_copy(x,h_x); 7

CG Solve: The AXPBY § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } § Kokkos Implementation: void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } 8

CG Solve: The AXPBY § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } Parallel Pattern: for loop § Kokkos Implementation: void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } 9

CG Solve: The AXPBY § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } String Label: Profiling/Debugging § Kokkos Implementation: void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } 10

CG Solve: The AXPBY § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } Execution Policy: do n iterations § Kokkos Implementation: void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } 11

CG Solve: The AXPBY § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } Iteration handle: integer index § Kokkos Implementation: void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } 12

CG Solve: The AXPBY § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } Loop Body § Kokkos Implementation: void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } 13

CG Solve: The AXPBY § Simple data parallel loop: Kokkos::parallel_for § Easy to express in most programming models § Bandwidth bound § Serial Implementation: void axpby(int n, double* z, double alpha, const double* x, double beta, const double* y) { for(int i=0; i<n; i++) z[i] = alpha*x[i] + beta*y[i]; } § Kokkos Implementation: void axpby(int n, View<double*> z, double alpha, View<const double*> x, double beta, View<const double*> y) { parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) { z(i) = alpha*x(i) + beta*y(i); }); } 14

CG Solve: The Dot Product § Simple data parallel loop with reduction: Kokkos::parallel_reduce § Non trivial in CUDA due to lack of built-in reduction support § Bandwidth bound § Serial Implementation: double dot(int n, const double* x, const double* y) { double sum = 0.0; for(int i=0; i<n; i++) sum += x[i]*y[i]; return sum; } § Kokkos Implementation: double dot(int n, View<const double*> x, View<const double*> y) { double x_dot_y = 0.0; parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) { sum += x[i]*y[i]; }, x_dot_y); return x_dot_y; } 15

CG Solve: The Dot Product § Simple data parallel loop with reduction: Kokkos::parallel_reduce § Non trivial in CUDA due to lack of built-in reduction support § Bandwidth bound § Serial Implementation: double dot(int n, const double* x, const double* y) { double sum = 0.0; for(int i=0; i<n; i++) sum += x[i]*y[i]; return sum; } Parallel Pattern: loop with reduction § Kokkos Implementation: double dot(int n, View<const double*> x, View<const double*> y) { double x_dot_y = 0.0; parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) { sum += x[i]*y[i]; }, x_dot_y); return x_dot_y; } 16

CG Solve: The Dot Product § Simple data parallel loop with reduction: Kokkos::parallel_reduce § Non trivial in CUDA due to lack of built-in reduction support § Bandwidth bound § Serial Implementation: double dot(int n, const double* x, const double* y) { double sum = 0.0; for(int i=0; i<n; i++) sum += x[i]*y[i]; return sum; } Iteration Index + Thread-Local Red. Varible § Kokkos Implementation: double dot(int n, View<const double*> x, View<const double*> y) { double x_dot_y = 0.0; parallel_reduce("Dot",n, KOKKOS_LAMBDA (const int& i,double& sum) { sum += x[i]*y[i]; }, x_dot_y); return x_dot_y; } 17

Kokkos: The C++ Performance Portability Programming Model Christian - PowerPoint PPT Presentation

Kokkos: The C++ Performance Portability Programming Model Christian Trott (crtrott@sandia.gov), Carter Edwards D. Sunderland, N. Ellingwood, D. Ibanez, S. Hammond, S. Rajamanickam, K. Kim, M. Deveci, M. Hoemmen, G. Center for Computing Research,

Number Portability Three kinds of number portability Location portability: a subscriber may move

Kokkos: Performance Portability and Photos placed in horizontal position with even amount

Kokkos, Manycore Device Photos placed in horizontal position Performance Portability with even

Is it performance portability when Im using (small) DGEMM? Dagstuhl Seminar: Performance

EXPLORER+500 Performance and portability combined EXPLORER+ 500 The most used BGAN terminal

Sustainability and Performance through Kokkos: A Case Study with LAMMPS Chris&an Tro, , Si

Kokkos Implementation of Albany: you Towards Performance Portable e Finite Element Code logo

CMEMS data through Social Media Authors: G. Sylaios, N. Kokkos, K. Zachopoulos, M. Zoidou

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal position with even amount

Uintah Architecture Open source software UQ DRIVERS ARCHES DSL: NEBO Worldwide

Kokkos update: Memory Spaces, Execution Spaces, Photos placed in horizontal position with even

Kokkos Task-DAG: Photos placed in Memory Management and Locality horizontal position with even

It Its confusing HEALTHPLAN2020 BHB UNIFIED UNIVERSAL MFR MOCKPLAN PORTABILITY UN-INSURED

The Right to Data Portability: Privacy and An7trust Analysis Professor Peter Swire Ohio State

JEDI Portability Across Platforms Containers, Cloud Computing, and HPC Outline I) JEDI

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

An Interlacing Approach for Bounding the Sum of Laplacian Eigenvalues of Graphs Aida Abiad

Computational Reproducibility in Production Physics Applications Numerical Reproducibility at

o Total pavement replacement from Graham Road (SLM 7.60) to just North of SR-303 (SLM 13.30), in

Outcomes Based Formula Summit April 27, 2017 Lieutenant Governors Apartment 2 Formula

AANAPISI at De Anza: How We Can Be Successful Presentation for Meeting: Exploring Federal Grant

Arkusz1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Flexible Resource Adequacy Criteria and Must Offer Obligation Phase 2 Draft Framework

Kokkos: The C++ Performance Portability Programming Model Christian - PowerPoint PPT Presentation

Kokkos: The C++ Performance Portability Programming Model Christian Trott (crtrott@sandia.gov), Carter Edwards D. Sunderland, N. Ellingwood, D. Ibanez, S. Hammond, S. Rajamanickam, K. Kim, M. Deveci, M. Hoemmen, G. Center for Computing Research,

Number Portability Three kinds of number portability Location portability: a subscriber may move

Kokkos: Performance Portability and Photos placed in horizontal position with even amount

Kokkos, Manycore Device Photos placed in horizontal position Performance Portability with even

Is it performance portability when Im using (small) DGEMM? Dagstuhl Seminar: Performance

EXPLORER+500 Performance and portability combined EXPLORER+ 500 The most used BGAN terminal

Sustainability and Performance through Kokkos: A Case Study with LAMMPS Chris&amp;an Tro, , Si

Kokkos Implementation of Albany: you Towards Performance Portable e Finite Element Code logo

CMEMS data through Social Media Authors: G. Sylaios, N. Kokkos, K. Zachopoulos, M. Zoidou

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal position with even amount

Uintah Architecture Open source software UQ DRIVERS ARCHES DSL: NEBO Worldwide

Kokkos update: Memory Spaces, Execution Spaces, Photos placed in horizontal position with even

Kokkos Task-DAG: Photos placed in Memory Management and Locality horizontal position with even

It Its confusing HEALTHPLAN2020 BHB UNIFIED UNIVERSAL MFR MOCKPLAN PORTABILITY UN-INSURED

The Right to Data Portability: Privacy and An7trust Analysis Professor Peter Swire Ohio State

JEDI Portability Across Platforms Containers, Cloud Computing, and HPC Outline I) JEDI

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

An Interlacing Approach for Bounding the Sum of Laplacian Eigenvalues of Graphs Aida Abiad

Computational Reproducibility in Production Physics Applications Numerical Reproducibility at

o Total pavement replacement from Graham Road (SLM 7.60) to just North of SR-303 (SLM 13.30), in

Outcomes Based Formula Summit April 27, 2017 Lieutenant Governors Apartment 2 Formula

AANAPISI at De Anza: How We Can Be Successful Presentation for Meeting: Exploring Federal Grant

Arkusz1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Flexible Resource Adequacy Criteria and Must Offer Obligation Phase 2 Draft Framework

Sustainability and Performance through Kokkos: A Case Study with LAMMPS Chris&an Tro, , Si