The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre Member of the Helmholtz Association
Requirements for MD Strong scalability Performance portability HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime March 21, 2019 Member of the Helmholtz Association Slide 1
Requirements for MD Strong scalability Performance portability HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime March 21, 2019 Member of the Helmholtz Association Slide 1
Requirements for MD Strong scalability Performance portability HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime March 21, 2019 Member of the Helmholtz Association Slide 1
Requirements for MD Strong scalability Performance portability HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime March 21, 2019 Member of the Helmholtz Association Slide 1
HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime Requirements for MD Strong scalability Performance portability March 21, 2019 Member of the Helmholtz Association Slide 1
Our Motivation Solving Coulomb problem for Molecular Dynamics Task: Compute all pairwise interactions of N particles N-body problem: O ( N 2 ) → O ( N ) with FMM Why is that an issue? MD targets < 1 ms runtime per time step MD runs millions or billions of time steps not compute-bound, but synchronization bound no libraries (like BLAS) to do the heavy lifting We might have to look under the hood ... and get our hands dirty. March 21, 2019 Member of the Helmholtz Association Slide 2
Parallelization Potential hard Classical Approach Parallelization Lots of independent parallelism Classical easy O ( N 2 ) high low Algorithmic Complexity March 21, 2019 Member of the Helmholtz Association Slide 3
Parallelization Potential FMM hard O ( N ) Fast Multipole Method (FMM) Parallelization Many dependent phases Varying amount of parallelism Classical easy O ( N 2 ) high low Algorithmic Complexity March 21, 2019 Member of the Helmholtz Association Slide 4
Different amount of available loop-level parallelism within each phase Some phases contain sub-dependencies Synchronizations might be problematic Coarse-Grained Parallelization synchronization points Input Output P2M M2M M2L L2L L2P P2P March 21, 2019 Member of the Helmholtz Association Slide 5
Coarse-Grained Parallelization synchronization points Input Output P2M M2M M2L L2L L2P P2P Different amount of available loop-level parallelism within each phase Some phases contain sub-dependencies Synchronizations might be problematic March 21, 2019 Member of the Helmholtz Association Slide 5
Dataflow – Fine-grained Dependencies 𝜕 FMM Algorithmic Flow Multipole to multipole (M2M), shifting multipoles upwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 March 21, 2019 Member of the Helmholtz Association Slide 6
𝜕 𝜕 𝜕 𝜕 FMM Algorithmic Flow Multipole to multipole (M2M), shifting multipoles upwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 Dataflow – Fine-grained Dependencies p2m l2p m2m m2l l2l March 21, 2019 Member of the Helmholtz Association Slide 7
Dataflow – Fine-grained Dependencies 𝜈 FMM Algorithmic Flow Multipole to local (M2L), translate remote multipoles into local taylor moments d = 0 + 1 2 + + + 3 + + + 4 March 21, 2019 Member of the Helmholtz Association Slide 8
𝜕 𝜈 𝜈 𝜈 𝜕 FMM Algorithmic Flow Multipole to local (M2L), translate remote multipoles into local taylor moments d = 0 + 1 2 + + + 3 + + + 4 Dataflow – Fine-grained Dependencies p2m l2p m2m m2l l2l March 21, 2019 Member of the Helmholtz Association Slide 9
Dataflow – Fine-grained Dependencies 𝜈 FMM Algorithmic Flow Local to local (L2L), shifting Taylor moments downwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 March 21, 2019 Member of the Helmholtz Association Slide 10
𝜈 𝜈 𝜈 𝜈 FMM Algorithmic Flow Local to local (L2L), shifting Taylor moments downwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 Dataflow – Fine-grained Dependencies p2m l2p m2m m2l l2l March 21, 2019 Member of the Helmholtz Association Slide 11
Dispatcher TaskFactory LoadBalancer ⋯ CPU Tasking Framework Queue Scheduler ThreadingWrapper Thread Core March 21, 2019 Member of the Helmholtz Association Slide 12
Dispatcher TaskFactory LoadBalancer ⋯ CPU Tasking Framework Queue Scheduler ThreadingWrapper Thread Core March 21, 2019 Member of the Helmholtz Association Slide 12
⋯ CPU Tasking Framework Queue Dispatcher Scheduler TaskFactory ThreadingWrapper Thread LoadBalancer Core March 21, 2019 Member of the Helmholtz Association Slide 12
⋯ CPU Tasking Framework Queue Dispatcher Scheduler TaskFactory ThreadingWrapper Thread LoadBalancer Core March 21, 2019 Member of the Helmholtz Association Slide 12
Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible � CPU Tasking Framework Task life-cycle per thread � new task TaskFactory LoadBalancer � Task execution Dispatcher Queues March 21, 2019 Member of the Helmholtz Association Slide 13
Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible CPU Tasking Framework Task life-cycle per thread � new task � task TaskFactory LoadBalancer � Task execution Dispatcher Queues March 21, 2019 Member of the Helmholtz Association Slide 13
Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible CPU Tasking Framework Task life-cycle per thread � new task � task TaskFactory LoadBalancer � Task execution Dispatcher Queues March 21, 2019 Member of the Helmholtz Association Slide 13
Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible CPU Tasking Framework Task life-cycle per thread � new task � task TaskFactory LoadBalancer � Task execution Dispatcher Queues March 21, 2019 Member of the Helmholtz Association Slide 13
� CPU Tasking Framework Task life-cycle per thread � new task � new task � task TaskFactory LoadBalancer � Task execution Dispatcher Queues Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible March 21, 2019 Member of the Helmholtz Association Slide 13
Tasking Without Workstealing 103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores) 24 L2P 20 #Active Threads 16 P2P L2L 12 P2M M2M 8 M2L 4 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 Runtime [ s ] March 21, 2019 Member of the Helmholtz Association Slide 14
Tasking With Workstealing 103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores) L2L L2P 24 20 #Active Threads 16 P2P 12 P2M M2M 8 M2L 4 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 Runtime [ s ] March 21, 2019 Member of the Helmholtz Association Slide 15
The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre Member of the Helmholtz Association
GPU Tasking Goal Provide same features as CPU tasking: Static and dynamic load balancing Priority queues Ready-to-execute tasks March 21, 2019 Member of the Helmholtz Association Slide 16
GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17
GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17
GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17
GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17
GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17
Pitfalls Performance Portability Diverse GPU programming approaches: OpenCL CUDA SYCL Our requirements: Strong subset of C++11 Portability between GPU vendors Tasking features Maturity (Intermediate) Solution Use CUDA for reasons of performance, specific tasking features and maturity. Take the loss of not being portable out of the box. March 21, 2019 Member of the Helmholtz Association Slide 18
Recommend
More recommend