Kokkos Task-DAG: Photos placed in Memory Management and Locality horizontal position with even amount of white space Challenges Conquered between photos and header PADAL Workshop Photos placed in horizontal August 2-4, 2017 position with even amount of white Chicago, IL space between photos and header H. Carter Edwards SAND2017-8173 C Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
LAMMPS EMPIRE Albany SPARC Drekar Applications & Libraries Trilinos Kokkos* performance portability for C++ applications HBM HBM HBM HBM DDR DDR DDR DDR DDR Multi-Core APU CPU+GPU Many-Core *κόκκος Greek: “granule” or “grain” ; like grains of sand on a beach 1
Dynamic Directed Acyclic Graph (DAG) of Tasks § Parallel Pattern § Tasks: Heterogeneous collection of parallel computations § DAG: Tasks may have acyclic execute-after dependences § Dynamic: Tasks allocated by executing tasks, deallocated when complete § Task Scheduler Responsibilities § Execute ready tasks § Choose from among ready tasks § Honor “execute after” dependences § Manage tasks’ dynamic lifecycle § Manage tasks’ dynamic memory 2
Motivating Use Cases 1. Multifrontal Cholesky factorization of sparse matrix § Frontal matrices require different 0 1 2 3 4 5 6 7 8 6 X X 0 0 X X X X 7 X sizes of workspace (green) for sub-assembly 1 1 X X X 8 2 2 X 3 3 X X X § Hybrid task parallelism: tree-parallel & 4 4 X X 3 X X X 5 5 X matrix-parallel within supernodes (brown) 4 X X 6 6 X X 6 X 7 7 X 8 8 8 § Dynamic task-dag with memory constraints § Matrix computation is internally data parallel 0 X X X X 2 X 5 X 1 X X X 3 7 § Lead: Kyungjoo Kim / SNL 4 X X 7 X 8 2. Triangle enumeration in social networks, highly irregular graphs § Discover triangles within the graph 3 § Compute statistics on those triangles k1 § Triangles are an intermediate result that do not need to be saved / stored 4 1 5 Ø Challenge: memory “high water mark” k2 § Lead: Michael Wolf / SNL 2 3
Hierarchical Parallelism § Shared functionality with hierarchical data-data parallelism § The same kernel (task) executed on … § OpenMP: League of Teams of Threads § Cuda: Grid of Blocks of Threads § Inter-Team Parallelism (data or task) § Threads within a team execute concurrently § Data: each team executes the same computation Ø Task: each team executes a different task parallel_for § Intra-Team Parallelism (data) parallel_reduce § Nested parallel patterns: for, reduce, scan § Mapping teams onto hardware § CPU : team == hyperthreads sharing L1 cache’ § GPU : team == warp, for a modest degree of intra-team data parallelism 4
Anatomy and Life-cycle of a Task § Anatomy § Is a C++ closure (e.g., functor) of data + function § Is referenced by a Kokkos::future § Executes on a single thread or a thread team § May only execute when its dependences are complete (DAG) § Life-cycle: constructing waiting executing complete serial task task with internal data parallelism on a single thread on a thread team 5
Dynamic Task DAG Challenges § A DAG of heterogeneous closures § Map execution to a single thread or a thread team § Scalable, low latency scheduling § Scalable dynamically allocated / deallocated tasks § Scalable dynamically created and completed execute-after dependences § GPU idiosyncrasies Ø Non-blocking tasks, forced a beneficial reconceptualization! § Eliminate context switching overhead: stack, registers, ... § Heterogeneous function pointers (CPU, GPU) § Creating GPU tasks on the host and within tasks executing on the GPU § Bounded memory pool and scalable allocation/deallocation § Non-coherent L1 caches 6
Managing a Non-blocking Task’s Lifecycle § Create: allocate and construct create § By main process or within another task § Allocate from a memory pool constructing spawn § Construct internal data § Assign DAG dependences waiting § Spawn: enqueue to scheduler executing § Assign DAG dependences § Assign priority: high, regular, low respawn § Respawn: re-enqueue to scheduler complete § Replaces waiting or yielding § Assign new DAG dependences and/or priority § Reconceived wait-for-child-task pattern Ø Create & spawn child task(s) Ø Reassign DAG dependence(s) to new child task(s) Ø Re-spawn to execute again after child task(s) complete 7
Task Scheduler and Memory Pool § Memory Pool § Large chunk of memory allocated in Kokkos memory space § Allocate & deallocate small blocks of varying size within a parallel execution § Lock free, extremely low latency § Tuning: min-alloc-size <= max-alloc-size <= superblock-size <= total-size § Task Scheduler dep § Uses memory pool for tasks’ memory next next § Ready queues (by priority) and waiting queues dep Ø Each queue is a simple linked list of tasks next § A ready queue is a head of a linked list § Each task is head of linked list of “execute after” tasks § Limit updates to push/pop, implemented with atomic operations § “When all” is a non-executing task with list of dependences for data 8
Memory Pool Performance § Test Setup § 10Mb pool comprised of 153 x 64k superblocks, min block size 32 bytes § Allocations ranging between 32 and 128 bytes; average 80 bytes § [1] Allocate to N% ; [2] cyclically deallocate & allocate between N and 2/3 N § parallel_for: every index allocates ; cyclically deallocates & allocates § Measure allocate + deallocate operations / second (best of 10 trials) § Deallocate much simpler and fewer operations than allocate § Test Hardware: Pascal, Broadwell, Knights Landing § Fully subscribe cores § Every thread within every warp allocates & deallocates § For reference, an “apples to oranges” comparison § CUDA malloc / free on Pascal § jemalloc on Knights Landing 9
Memory Pool Performance Fill 75% Fill 95% Cycle 75% Cycle 95% blocks: 938,500 1,187,500 Pascal 79 M/s 74 M/s 287 M/s 244 M/s Broadwell 13 M/s 13 M/s 46 M/s 49 M/s Knights Landing 5.8 M/s 5.8 M/s 40 M/s 43 M/s apples to oranges comparison: Pascal 3.5 M/s 2.9 M/s 15 M/s 12 M/s using CUDA malloc Knights Landing 379 M/s 4115 M/s using jemalloc thread local caches, optimal blocking, NOT fixed pool size § Memory pools have finite size with well-bounded scope § Algorithms’ and data structures’ memory pools do not pollute (fragment) each other’s memory 10
Scheduler Unit Test Performance § Test Setup, (silly) Fibonacci task-dag algorithm § F(k) = F(k-1) + F(k-2) § if k >= 2 spawn F(k-1) and F(k-2) then § respawn F(k) dependent on completion of when_all( { F(k-1) , F(k-2) } ) § F(k) cumulatively allocates/deallocates N tasks >> “high water mark” § 1Mb pool comprised of 31 x 32k superblocks, min block size 32 bytes § Fully subscribe cores; single thread Fibonacci task consumes entire GPU warp § Real algorithms’ tasks have modest internal parallelism § Measure tasks / second; compare to raw allocate + deallocate performance F(21) F(23) Alloc/Dealloc cumulative tasks: 53131 139102 (for comparison) Pascal 1.2 M/s 1.3 M/s 144 M/s Broadwell 0.98 M/s 1.1 M/s 24 M/s Knights Landing 0.30 M/s 0.31 M/s 21 M/s 11
GPU Non-Coherent L1 Cache § Production and Consumption of Tasks § Create: allocate from memory pool and construct closure in that memory § Complete: destroy closure and deallocate to memory pool § Task memory is re-used as the dynamic task-DAG executes § “Race” consequence of non-coherent L1 cache: Global SM1 SM0 [1] execute Memory L1 L1 & complete Pool cache cache [3] allocate [2] deallocate task-B block task-A {[3-4] untouched} [4] construct [5] pop-queue & push-queue [6] execute task-?? 12
GPU Non-Coherent L1 Cache: Conquered § Options: § Mark all user task code with “virtual” qualifier to bypass L1 cache (CUDA) § Extremely annoying to users: ugly and degrades performance § Manage memory motion through GPU shared memory (a.k.a., explicit L1) Ø Transparent to user code and retains L1 performance Global SM1 SM0 [1] execute Memory explicit explicit & complete Pool cache cache [3] allocate [2] deallocate task-B block task-A task-B [4.1] construct [5.1] pop-queue [4.2] copy [5.2] copy [6] execute [4.3] push-queue task-B 13
Tacho’s Sparse Cholesky Factorization § Multifrontal algorithm with bounded memory constraint § Kokkos task DAG + Kokkos memory pool for shared scratch memory § Task fails allocation => respawn to try again after other tasks deallocate § Test setup: scratch memory size = M * sparse matrix supernode size § Compare to Intel’s pardiso, sparse matrix N=57k, NNZ=383k, 6662 supernodes Haswell (2x16x2) Knights Landing (1x68x4) factorization/minute factorization/minute 1500 700 600 pardiso 500 pardiso 1000 400 tacho 4 tacho 4 300 tacho 8 tacho 8 500 200 tacho tacho 16 100 16 # threads # threads 0 0 0 20 40 60 80 0 20 40 60 80 400 400 350 350 peak memory MB peak memory MB 300 pardiso 300 pardiso 250 250 tacho 4 tacho 200 200 tacho 8 150 tacho 150 tacho 100 100 tacho 16 50 50 # threads # threads 0 0 0 20 40 60 80 0 20 40 60 80 14
Recommend
More recommend