the rocky road to tasking
play

The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura - PowerPoint PPT Presentation

The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jlich Supercomputing Centre Member of the Helmholtz Association Requirements for MD Strong scalability Performance portability HPC HPC High Frequency Trading


  1. The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre Member of the Helmholtz Association

  2. Requirements for MD Strong scalability Performance portability HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime March 21, 2019 Member of the Helmholtz Association Slide 1

  3. Requirements for MD Strong scalability Performance portability HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime March 21, 2019 Member of the Helmholtz Association Slide 1

  4. Requirements for MD Strong scalability Performance portability HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime March 21, 2019 Member of the Helmholtz Association Slide 1

  5. Requirements for MD Strong scalability Performance portability HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime March 21, 2019 Member of the Helmholtz Association Slide 1

  6. HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime Requirements for MD Strong scalability Performance portability March 21, 2019 Member of the Helmholtz Association Slide 1

  7. Our Motivation Solving Coulomb problem for Molecular Dynamics Task: Compute all pairwise interactions of N particles N-body problem: O ( N 2 ) → O ( N ) with FMM Why is that an issue? MD targets < 1 ms runtime per time step MD runs millions or billions of time steps not compute-bound, but synchronization bound no libraries (like BLAS) to do the heavy lifting We might have to look under the hood ... and get our hands dirty. March 21, 2019 Member of the Helmholtz Association Slide 2

  8. Parallelization Potential hard Classical Approach Parallelization Lots of independent parallelism Classical easy O ( N 2 ) high low Algorithmic Complexity March 21, 2019 Member of the Helmholtz Association Slide 3

  9. Parallelization Potential FMM hard O ( N ) Fast Multipole Method (FMM) Parallelization Many dependent phases Varying amount of parallelism Classical easy O ( N 2 ) high low Algorithmic Complexity March 21, 2019 Member of the Helmholtz Association Slide 4

  10. Different amount of available loop-level parallelism within each phase Some phases contain sub-dependencies Synchronizations might be problematic Coarse-Grained Parallelization synchronization points Input Output P2M M2M M2L L2L L2P P2P March 21, 2019 Member of the Helmholtz Association Slide 5

  11. Coarse-Grained Parallelization synchronization points Input Output P2M M2M M2L L2L L2P P2P Different amount of available loop-level parallelism within each phase Some phases contain sub-dependencies Synchronizations might be problematic March 21, 2019 Member of the Helmholtz Association Slide 5

  12. Dataflow – Fine-grained Dependencies 𝜕 FMM Algorithmic Flow Multipole to multipole (M2M), shifting multipoles upwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 March 21, 2019 Member of the Helmholtz Association Slide 6

  13. 𝜕 𝜕 𝜕 𝜕 FMM Algorithmic Flow Multipole to multipole (M2M), shifting multipoles upwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 Dataflow – Fine-grained Dependencies p2m l2p m2m m2l l2l March 21, 2019 Member of the Helmholtz Association Slide 7

  14. Dataflow – Fine-grained Dependencies 𝜈 FMM Algorithmic Flow Multipole to local (M2L), translate remote multipoles into local taylor moments d = 0 + 1 2 + + + 3 + + + 4 March 21, 2019 Member of the Helmholtz Association Slide 8

  15. 𝜕 𝜈 𝜈 𝜈 𝜕 FMM Algorithmic Flow Multipole to local (M2L), translate remote multipoles into local taylor moments d = 0 + 1 2 + + + 3 + + + 4 Dataflow – Fine-grained Dependencies p2m l2p m2m m2l l2l March 21, 2019 Member of the Helmholtz Association Slide 9

  16. Dataflow – Fine-grained Dependencies 𝜈 FMM Algorithmic Flow Local to local (L2L), shifting Taylor moments downwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 March 21, 2019 Member of the Helmholtz Association Slide 10

  17. 𝜈 𝜈 𝜈 𝜈 FMM Algorithmic Flow Local to local (L2L), shifting Taylor moments downwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 Dataflow – Fine-grained Dependencies p2m l2p m2m m2l l2l March 21, 2019 Member of the Helmholtz Association Slide 11

  18. Dispatcher TaskFactory LoadBalancer ⋯ CPU Tasking Framework Queue Scheduler ThreadingWrapper Thread Core March 21, 2019 Member of the Helmholtz Association Slide 12

  19. Dispatcher TaskFactory LoadBalancer ⋯ CPU Tasking Framework Queue Scheduler ThreadingWrapper Thread Core March 21, 2019 Member of the Helmholtz Association Slide 12

  20. ⋯ CPU Tasking Framework Queue Dispatcher Scheduler TaskFactory ThreadingWrapper Thread LoadBalancer Core March 21, 2019 Member of the Helmholtz Association Slide 12

  21. ⋯ CPU Tasking Framework Queue Dispatcher Scheduler TaskFactory ThreadingWrapper Thread LoadBalancer Core March 21, 2019 Member of the Helmholtz Association Slide 12

  22. Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible � CPU Tasking Framework Task life-cycle per thread � new task TaskFactory LoadBalancer � Task execution Dispatcher Queues March 21, 2019 Member of the Helmholtz Association Slide 13

  23. Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible CPU Tasking Framework Task life-cycle per thread � new task � task TaskFactory LoadBalancer � Task execution Dispatcher Queues March 21, 2019 Member of the Helmholtz Association Slide 13

  24. Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible CPU Tasking Framework Task life-cycle per thread � new task � task TaskFactory LoadBalancer � Task execution Dispatcher Queues March 21, 2019 Member of the Helmholtz Association Slide 13

  25. Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible CPU Tasking Framework Task life-cycle per thread � new task � task TaskFactory LoadBalancer � Task execution Dispatcher Queues March 21, 2019 Member of the Helmholtz Association Slide 13

  26. � CPU Tasking Framework Task life-cycle per thread � new task � new task � task TaskFactory LoadBalancer � Task execution Dispatcher Queues Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible March 21, 2019 Member of the Helmholtz Association Slide 13

  27. Tasking Without Workstealing 103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores) 24 L2P 20 #Active Threads 16 P2P L2L 12 P2M M2M 8 M2L 4 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 Runtime [ s ] March 21, 2019 Member of the Helmholtz Association Slide 14

  28. Tasking With Workstealing 103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores) L2L L2P 24 20 #Active Threads 16 P2P 12 P2M M2M 8 M2L 4 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 Runtime [ s ] March 21, 2019 Member of the Helmholtz Association Slide 15

  29. The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre Member of the Helmholtz Association

  30. GPU Tasking Goal Provide same features as CPU tasking: Static and dynamic load balancing Priority queues Ready-to-execute tasks March 21, 2019 Member of the Helmholtz Association Slide 16

  31. GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17

  32. GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17

  33. GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17

  34. GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17

  35. GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17

  36. Pitfalls Performance Portability Diverse GPU programming approaches: OpenCL CUDA SYCL Our requirements: Strong subset of C++11 Portability between GPU vendors Tasking features Maturity (Intermediate) Solution Use CUDA for reasons of performance, specific tasking features and maturity. Take the loss of not being portable out of the box. March 21, 2019 Member of the Helmholtz Association Slide 18

Recommend


More recommend