The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura - PowerPoint PPT Presentation

The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre Member of the Helmholtz Association

Requirements for MD Strong scalability Performance portability HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime March 21, 2019 Member of the Helmholtz Association Slide 1

HPC ≠ HPC High Frequency Trading Deep Learning MD CPU Cycle Network Latency Astrophysics Game Dev ns 𝜈 s ms s min h Critical walltime Requirements for MD Strong scalability Performance portability March 21, 2019 Member of the Helmholtz Association Slide 1

Our Motivation Solving Coulomb problem for Molecular Dynamics Task: Compute all pairwise interactions of N particles N-body problem: O ( N 2 ) → O ( N ) with FMM Why is that an issue? MD targets < 1 ms runtime per time step MD runs millions or billions of time steps not compute-bound, but synchronization bound no libraries (like BLAS) to do the heavy lifting We might have to look under the hood ... and get our hands dirty. March 21, 2019 Member of the Helmholtz Association Slide 2

Parallelization Potential hard Classical Approach Parallelization Lots of independent parallelism Classical easy O ( N 2 ) high low Algorithmic Complexity March 21, 2019 Member of the Helmholtz Association Slide 3

Parallelization Potential FMM hard O ( N ) Fast Multipole Method (FMM) Parallelization Many dependent phases Varying amount of parallelism Classical easy O ( N 2 ) high low Algorithmic Complexity March 21, 2019 Member of the Helmholtz Association Slide 4

Different amount of available loop-level parallelism within each phase Some phases contain sub-dependencies Synchronizations might be problematic Coarse-Grained Parallelization synchronization points Input Output P2M M2M M2L L2L L2P P2P March 21, 2019 Member of the Helmholtz Association Slide 5

Coarse-Grained Parallelization synchronization points Input Output P2M M2M M2L L2L L2P P2P Different amount of available loop-level parallelism within each phase Some phases contain sub-dependencies Synchronizations might be problematic March 21, 2019 Member of the Helmholtz Association Slide 5

Dataflow – Fine-grained Dependencies 𝜕 FMM Algorithmic Flow Multipole to multipole (M2M), shifting multipoles upwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 March 21, 2019 Member of the Helmholtz Association Slide 6

𝜕 𝜕 𝜕 𝜕 FMM Algorithmic Flow Multipole to multipole (M2M), shifting multipoles upwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 Dataflow – Fine-grained Dependencies p2m l2p m2m m2l l2l March 21, 2019 Member of the Helmholtz Association Slide 7

Dataflow – Fine-grained Dependencies 𝜈 FMM Algorithmic Flow Multipole to local (M2L), translate remote multipoles into local taylor moments d = 0 + 1 2 + + + 3 + + + 4 March 21, 2019 Member of the Helmholtz Association Slide 8

𝜕 𝜈 𝜈 𝜈 𝜕 FMM Algorithmic Flow Multipole to local (M2L), translate remote multipoles into local taylor moments d = 0 + 1 2 + + + 3 + + + 4 Dataflow – Fine-grained Dependencies p2m l2p m2m m2l l2l March 21, 2019 Member of the Helmholtz Association Slide 9

Dataflow – Fine-grained Dependencies 𝜈 FMM Algorithmic Flow Local to local (L2L), shifting Taylor moments downwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 March 21, 2019 Member of the Helmholtz Association Slide 10

𝜈 𝜈 𝜈 𝜈 FMM Algorithmic Flow Local to local (L2L), shifting Taylor moments downwards d = 0 + + 1 + + + + 2 + + + + + + + + 3 + + + + + + + + + + + + + + + + 4 Dataflow – Fine-grained Dependencies p2m l2p m2m m2l l2l March 21, 2019 Member of the Helmholtz Association Slide 11

Dispatcher TaskFactory LoadBalancer ⋯ CPU Tasking Framework Queue Scheduler ThreadingWrapper Thread Core March 21, 2019 Member of the Helmholtz Association Slide 12

⋯ CPU Tasking Framework Queue Dispatcher Scheduler TaskFactory ThreadingWrapper Thread LoadBalancer Core March 21, 2019 Member of the Helmholtz Association Slide 12

Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible � CPU Tasking Framework Task life-cycle per thread � new task TaskFactory LoadBalancer � Task execution Dispatcher Queues March 21, 2019 Member of the Helmholtz Association Slide 13

Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible CPU Tasking Framework Task life-cycle per thread � new task � task TaskFactory LoadBalancer � Task execution Dispatcher Queues March 21, 2019 Member of the Helmholtz Association Slide 13

� CPU Tasking Framework Task life-cycle per thread � new task � new task � task TaskFactory LoadBalancer � Task execution Dispatcher Queues Tasks can be prioritized by task type Only ready-to-execute tasks are stored in queue Workstealing from other threads is possible March 21, 2019 Member of the Helmholtz Association Slide 13

Tasking Without Workstealing 103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores) 24 L2P 20 #Active Threads 16 P2P L2L 12 P2M M2M 8 M2L 4 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 Runtime [ s ] March 21, 2019 Member of the Helmholtz Association Slide 14

Tasking With Workstealing 103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores) L2L L2P 24 20 #Active Threads 16 P2P 12 P2M M2M 8 M2L 4 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 Runtime [ s ] March 21, 2019 Member of the Helmholtz Association Slide 15

The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre Member of the Helmholtz Association

GPU Tasking Goal Provide same features as CPU tasking: Static and dynamic load balancing Priority queues Ready-to-execute tasks March 21, 2019 Member of the Helmholtz Association Slide 16

GPU Tasking Uniform Programming Model for CPUs and GPUs March 21, 2019 Member of the Helmholtz Association Slide 17

Pitfalls Performance Portability Diverse GPU programming approaches: OpenCL CUDA SYCL Our requirements: Strong subset of C++11 Portability between GPU vendors Tasking features Maturity (Intermediate) Solution Use CUDA for reasons of performance, specific tasking features and maturity. Take the loss of not being portable out of the box. March 21, 2019 Member of the Helmholtz Association Slide 18

The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura - PowerPoint PPT Presentation

The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jlich Supercomputing Centre Member of the Helmholtz Association Requirements for MD Strong scalability Performance portability HPC HPC High Frequency Trading

CO2101 Processes and Multi-tasking Tom Ridge (tr61) 7th October 2019 tr61 Multi-tasking

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

CITY OF ROCKY RIVER CITY OF ROCKY RIVER Rocky River Gas Outage January 30, 2019 195 Homes

The Rocky Road to The Rocky Road to Hanno Bck https://hboeck.de @hanno 1 Transport Layer

CLOSING ROCKY FLATS: BALANCING COMPLEX ISSUES Federal Remediation Technologies Roundtable May

Managing Command and Control Information Using a C2IEDM Based Tasking Grammar Dr. Michael Hieb

A multi-tasking wordset for Standard Forth Andrew Haley Consulting Engineer 8 September 2017

Review First, operating systems solves time-sharing multi-tasking context = memory address

Life on a Rocky Shore by:Steven Otfinoski By:Kobie.C Main idea The main idea of the story is

Rocky Coasts Isabella Garramone Oceanography Fall 2009 General Overview 75% of the worlds

What is a road vehicle? Road & Non-road vehicles in the RVSA The Meaning of Road

Pouley Road Rustic Road Designation Pouley Road Rustic Road Designation Process From March 2006 to

Clearlake Road (State Road 501) Clearlake Road (State Road 501) Project Development &

Histon Road Streetscape Proposals Andy Cocks CMLI & Toby Edwards Histon Road LLF, 8 October

July 9-10, 2020 Progress of GPU Acceleration Module in nTRACER for Cycle Depletion Han Gyu Lee,

Google Cloud Dataflow Manuel Fahndrich Software Engineer Google Addictive Mobile Game

Efficient Binarization for Historical Document Analysis Florian Westphal H akan Grahn Niklas

Hardware evaluation and procurement Hardware: competition, evolution, Evaluation of CPU nodes

Using TSP to Develop and Maintain Mission Critical IT Systems Alex Obradovic 9/17/2013 Disclaimer

Software-Managed TRANSLATION Address Translation Bruce Jacob University of Michigan Bruce

COMPANY OVERVIEW Breakthrough Application-Specific Memory Technology MAY 12, 2017 Confidential

Effecting Efficiency Effortlessly Daniel Carden, Quanticate CONTENTS: SAS VIEWS WHERE

Sambuz

Useful Links

Newsletter

Mail Us

The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura - PowerPoint PPT Presentation

The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jlich Supercomputing Centre Member of the Helmholtz Association Requirements for MD Strong scalability Performance portability HPC HPC High Frequency Trading

CO2101 Processes and Multi-tasking Tom Ridge (tr61) 7th October 2019 tr61 Multi-tasking

Lambeth Lambeth Partnership Tasking Partnership Tasking &amp; &amp; Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking &amp; &amp; Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking &amp; &amp; Co- -ordination

CITY OF ROCKY RIVER CITY OF ROCKY RIVER Rocky River Gas Outage January 30, 2019 195 Homes

The Rocky Road to The Rocky Road to Hanno Bck https://hboeck.de @hanno 1 Transport Layer

CLOSING ROCKY FLATS: BALANCING COMPLEX ISSUES Federal Remediation Technologies Roundtable May

Managing Command and Control Information Using a C2IEDM Based Tasking Grammar Dr. Michael Hieb

A multi-tasking wordset for Standard Forth Andrew Haley Consulting Engineer 8 September 2017

Review First, operating systems solves time-sharing multi-tasking context = memory address

Life on a Rocky Shore by:Steven Otfinoski By:Kobie.C Main idea The main idea of the story is

Rocky Coasts Isabella Garramone Oceanography Fall 2009 General Overview 75% of the worlds

What is a road vehicle? Road &amp; Non-road vehicles in the RVSA The Meaning of Road

Pouley Road Rustic Road Designation Pouley Road Rustic Road Designation Process From March 2006 to

Clearlake Road (State Road 501) Clearlake Road (State Road 501) Project Development &amp;

Histon Road Streetscape Proposals Andy Cocks CMLI &amp; Toby Edwards Histon Road LLF, 8 October

July 9-10, 2020 Progress of GPU Acceleration Module in nTRACER for Cycle Depletion Han Gyu Lee,

Google Cloud Dataflow Manuel Fahndrich Software Engineer Google Addictive Mobile Game

Efficient Binarization for Historical Document Analysis Florian Westphal H akan Grahn Niklas

Hardware evaluation and procurement Hardware: competition, evolution, Evaluation of CPU nodes

Using TSP to Develop and Maintain Mission Critical IT Systems Alex Obradovic 9/17/2013 Disclaimer

Software-Managed TRANSLATION Address Translation Bruce Jacob University of Michigan Bruce

COMPANY OVERVIEW Breakthrough Application-Specific Memory Technology MAY 12, 2017 Confidential

Effecting Efficiency Effortlessly Daniel Carden, Quanticate CONTENTS: SAS VIEWS WHERE

Sambuz

Useful Links

Newsletter

Mail Us

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

What is a road vehicle? Road & Non-road vehicles in the RVSA The Meaning of Road

Clearlake Road (State Road 501) Clearlake Road (State Road 501) Project Development &

Histon Road Streetscape Proposals Andy Cocks CMLI & Toby Edwards Histon Road LLF, 8 October