Kokkos: Performance Portability and Photos placed in horizontal position with even amount Productivity for C++ Applications of white space between photos and header Performance Portability in Photos placed in horizontal Extreme Scale Computing: position Metrics, Challenges, Solutions with even amount of white space October 23-27, 2017 between photos and header Schloss Dagstuhl Seminar 17431 Wadern, Germany SAND2017-11734 PE Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP
LAMMPS EMPIRE Albany SPARC Drekar Applications & Libraries Trilinos Kokkos* performance portability for C++ applications HBM HBM HBM HBM DDR DDR DDR DDR DDR Multi-Core APU CPU+GPU Many-Core *κόκκος Greek: “granule” or “grain” ; like grains of sand on a beach 1
Performance Portability and Productivity § Economics: optimize F org (perf,port,prod) § Performance: execution time / energy to solution § On what platforms? § Portability: sustaining for multiple evolving architectures § Tool ecosystem: compilers, debuggers, analyzers, … § Interoperability with “architecture native” programming mechanism § Standard language using “as is” compilers; we use C++ § Productivity: aggregate development time and resources § Skills ecosystem: ease-of-use, education & training, support, … § Incremental path to adoption by legacy code; we use C++ § Kokkos’ 1+ 𝛇 economics § N codes on M architectures leading to N*(1+ 𝛇 *M) versions § O( 𝛇 *N*M) architecture specialized components § Written in “architecture native” programming mechanism 2
Kokkos Programming Model § Foundation § Well-defined Patterns for Parallel Programming (e.g., 2004; Mattson, et. al.) § Well-defined multidimensional array semantics § Strategy § User exposes parallelizable grains of computations and data § Kokkos maps grains onto hardware according to patterns and policies § Integrated mapping of both computations and data leading to architecture-appropriate memory access pattern § Policies may introduce architecture-specific parameters § Without changing source code § N( p )+O( 𝛇 *N*M) versions § Opportunity for auto-tuners to choose parameters? § Policy parameters have architecture-specific defaults § Work for simple / common use cases 3
Programming Model Abstractions Unique to Kokkos Parallel Pattern Polymorphic Multidimensional Array (for, reduce, scan, task-dag, …) (data structure pattern) Parallel Execution Policy Array Element Access Policy (Scheduling, Tiling, Thread Teams, …) (Layout, Tiling, RandomAccess, …) Execution Space Memory Space (CPU, GPU, which cores, which GPU) (CPU, GPU, which NUMA, which GPU) * Extensibility throughout 4
Multidimensional Array w/ Polymorphic Layout § Classical (50 years!) data pattern for science & engineering codes § Computer languages hard-wire multidimensional array layout mapping § Problem: different architectures require different layouts for performance Ø Leads to architecture-specific versions of code to obtain performance § E.g., “Array of Structure” ↔ “Structure of Array” redesigns e.g., e.g., “row-major” “column-major” CPU caching GPU coalescing § Kokkos separates layout from user’s computational code § Choose layout for architecture-specific memory access pattern Ø Without modifying user’s computational code § Polymorphic layout via C++ template meta-programming (extensible) Ø e.g., Hierarchical Tiling layout (array of structure of array) § Bonus: easy/transparent use of special data access hardware § Atomic operations, GPU texture cache, ... (extensible) 5
Performance Impact of Layout: Kokkos Tutorial Kernel: < y , Ax > <y|Ax> Exercise 04 (Layout) Fixed Size KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU 600 HSW Left HSW Right coalesced KNL Left KNL Right 500 Pascal60 Left Pascal60 Right 400 Bandwidth (GB/s) cached 300 uncoalesced 200 cached 100 uncached 0 1x10 6 1x10 7 1x10 8 1x10 9 1 10 100 1000 10000 100000 Number of Rows (N) 6
Multidimensional Array, Kokkos’ C++ API § View< ArrayType , Policy… > a ; § ArrayType defines scalar type and array’s static/dynamic dimensions § Layout mapping indexing operator : a(i,j,k,l) → memory location § Policy: memory space, layout, access intent, reference counting, … § Trivial to swap between array-of-struct (AoS) to struct-of-arrays (SoA) § This is Kokkos’ default between CPU and GPU § Layout is specifiable (otherwise defaults) § Why? For compatibility with legacy code, algorithmic performance tuning, ... § Layout is customizable § E.g., hierarchical tiling (brick) layout (AoSoA) § Changing layout can be transparent to existing code – If written layout-agnostic § Layout-aware algorithm can extract tiles 7
Patterns, Policies, and C++ Lambdas § Pattern composed with policy drives the computational body for ( int i = 0 ; i < N ; ++i ) { /* body */ } pattern policy body parallel_for ( N , [=]( int i ) { /* body */ } ); C++ lambda pattern( Policy<Params…>(params...) , body(args…) ); args… derived from pattern and Policy § Data parallel patterns: for, reduce, scan § Transparently manage thread local values, inter-thread reductions, … § Data parallel policies: 1D range, nD range, thread teams, … § Data parallel policy parameters (extensible) § Static or dynamic work partitioning § nD loop collapse ordering and tiling 8
Pattern and Policy (brand new!) Directed Acyclic Graph (DAG) of Tasks § Parallel Patterns: Task-DAG and Work-DAG § DAG: acyclic execute-after dependences § Task-DAG: Heterogeneous and dynamic collection of parallel computations § Work-DAG: Homogeneous and static collection of parallel computations § Task Scheduler Responsibilities § Choose and execute ready tasks § Update execute-after dependences § Manage tasks’ dynamic memory and lifecycle § GPU was a real challenge! 9
Pattern and Policy Directed Acyclic Graph (DAG) of Tasks § Task-DAG (heterogeneous and dynamic) § Tasks spawn tasks of different functions; DAG is dynamic spawn( Policy(params…) , body ); § GPU portability and performance: tasks cannot block or be interrupted respawn( this , params… ); // replaces “wait” semantics § Policy : single-thread/thread-team, dependences, priority § GPU challenges: scalable scheduler & memory pool, non-coherent L1 cache § Work-DAG (homogeneous and static) § DAG is declared up-front; single work function given integer work index § Similar to data-parallel with an execute-after index graph (CRS array) parallel_for( WorkGraphPolicy<Params...>( graph ), body( int index ) ); 10
Conclusion / Future § Performance Portability, for C++ Applications § Integrated mapping of applications’ computations and data Ø Other programming models fail to map data and limit performance portability § Future proofing via designed-in extensibility and ongoing R&D § github.com/kokkos/kokkos § Application Developer Productivity, for C++ Applications § C++ lambda for simple data parallel loop syntax § Reduce and Scan inter-thread complexity managed by Kokkos § Hierarchical parallelism using nested patterns can increase parallelism § Case Study: no harder than OpenMP, optimization is easier § Goal: Future ISO/C++ Standard subsumes Kokkos abstractions § Parallel algorithms (C++17) incremental step for data parallel pattern/policy § Next steps in progress: Executors and ExecutionContext § Polymorphic multidimensional array on track for C++20 § Atomic operations on non-atomic types on track for C++20 11
Recommend
More recommend