(Toward) Radiative transfer on AMR with GPUs Dominique Aubert - PowerPoint PPT Presentation

(Toward) Radiative transfer on AMR with GPUs Dominique Aubert Université de Strasbourg Austin, TX, 14.12.12 jeudi 13 décembre 2012

A few words about GPUs • Cache and control replaced by calculation units • Large number of Multiprocessors + Scheduler • High load + Independent + Non-Random Memory access • x10 to x100 compared to CPU • High-level interface with CUDA (C, Nvidia) , OpenCL (Kronos) • High-end GPUs ~1.5-2 kEuros, 4-6 GB RAM • Tianhe (Changsha, 7168 GPUs), Titan (Oak Ridge 7000-18 000 GPUs by 2012), MareNostrum (???), in France: Titane (198 GPUs), Curie (268 GPUs) jeudi 13 décembre 2012

Principle of GPU programming with CUDA Host GPU Shared GPU Host RAM RAM Memory RAM RAM If possible: independent & identical threads Calculations + High arithmetic intensity = acceleration Blocks Data Transfer jeudi 13 décembre 2012

1. Cosmological Radiative Transfer jeudi 13 décembre 2012

Radiative Transfer equations : explicit solver First 2 moments of the RT equations + variable Eddington Tensor with M1 closure relation Gonzales et al. 2008, Aubert & Teysier 2008, Rosdahl & Blaizot 2012 U p + 1 − U p + ∂ F ( U p ) ∂ t + ∂ F ( U ) ∂ U = S = S ∂ x ∆ t ∂ x 100 000 timesteps required to cover the Explicit: CFL constrains reionization (z~5) with GPUs it’s ok @ c < ∆ x ∆ t < ∆ x c=300 000 km/s ∆ t c Aubert & Teyssier, 08,10 jeudi 13 décembre 2012

Post-Processed Radiative Transfer with ATON gas density +sources Subcycled physics (almost) fixed number of operations radiative energy Independent and high load Ionisation state UV+X rad. transport Temperature H Chemistry Conservative transport fixed & predictable number of operations Regular 3D Grid heating Independent and contiguous calculations jeudi 13 décembre 2012

Performances GPUs VS CPUs CPU 256 3 (Opteron 2.7 GHz) 192 3 x80 128 3 64 3 256 3 192 3 128 3 GPU 8800 GTX 64 3 Aubert & Teyssier, 08,10 jeudi 13 décembre 2012

Multi-GPU with boundary layers Aubert & Teyssier, 08,10 jeudi 13 décembre 2012

Applications :TRASH Project ( T ransfert RA diatif S ur H ydrodynamique) Gas and source distribution from cudATON on TITANE-CCRT: the Mare Nostrum Hydro 1024 3 grid simulation Cartesian domain decomposition 1024x1024x1024 cells + 2 8x8x2 refinement levels (128 GPUs - S1070 servers- Infiniband DDR) ~60 000 - 180 000 time steps Self-consistent stellar particles dt ~10 000 yrs over 1 Gyrs used as sources jeudi 13 décembre 2012

cudATON on TITANE-CCRT: 1024 3 grid Cartesian domain decomposition 8x8x2 (128 GPUs - S1070 servers- Infiniband DDR) ~60 000 - 180 000 time steps dt ~10 000 yrs over 1 Gyrs Aubert & Teyssier 2010 Structure of the UV background @ different resolution and subgrid models Aubert & Teyssier, ApJ, 2010 jeudi 13 décembre 2012

Timings on Titane Communication ~10-15% global time 512 3 8 GPUs 1024 3 128 GPUs 1024 3 64 GPUs 512 3 64 GPUs Aubert & Teyssier, ApJ, 2010 jeudi 13 décembre 2012

Small scale effects 100 Mpc/h -1024 3 box clumping C(delta) extracted from a 12.5 h/Mpc -1024 3 with subgrid clumping without subgrid clumping Aubert & Teyssier, ApJ, 2010 jeudi 13 décembre 2012

J21 Vs nH x Vs nH Aubert & Teyssier, ApJ, 2010 jeudi 13 décembre 2012

Residual Neutral Fraction and J21 ~100 runs @ 1024 3 resolution Aubert & Teyssier, ApJ, 2010 jeudi 13 décembre 2012

Residual Neutral Fraction and J21 100 Mpc/h -1024 3 Aubert & Teyssier, ApJ, 2010 jeudi 13 décembre 2012

Application : Local Group Reionisation (with P . Ocvirk) CLUES zoom on the local Group Timing of the local reionisation ? Ocvirk et al.2012a,b (submitted+in prep.) jeudi 13 décembre 2012

Application: Merger Trees of HII regions during overlap (with J. Chardin) Chardin, Aubert & Ocvirk, A&A, 2012 jeudi 13 décembre 2012

Large Volumes for 21cm forecast (with B. Semelin) Grand Challenge Curie-CCRT 256 GPUs 2048x2048x2048 60 000 pdt -15h ionized fraction at z~10 Curie, CCRT-CEA jeudi 13 décembre 2012

RAMSES-RT (with T. Stranex & R. Teyssier, Zurich) RAMSES & ATON are coupled RAMSES ATON (DYNAMICS) (RT) UNIGRID version will be used on Titan for the INCITE project courtesy T. Stranex jeudi 13 décembre 2012

3. Towards Multi-Fluid AMR jeudi 13 décembre 2012

N-Body Radiative Transfer Hydro AMR+GPU+Multi ok GPU+ Multi ok AMR+GPU ok ~x10 w.r.t. CPU x 30-40 ? w.r.t. CPU ~x15 w.r.t. CPU Aubert et al. 2009 Aubert & Teyssier, 2008,2010 EMMA Project: -3 fluids coupled on an AMR structure with Hardware Acceleration, with e.g. GPUs jeudi 13 décembre 2012

Multi-GPU PM 1.2 billions particles (1024 3 real particles +2 10 8 ghosts) 8 sec/tstep on 64 Teslas with 25 % spent in communications with sort optimisation we may expect 6 sec/ tstep communication~40% asynchronous coms ? jeudi 13 décembre 2012

Under Heavy development jeudi 13 décembre 2012

Quartz • Written in C+CUDA+MPI • Parallel (Space-Filling Curve + essential Tree domain decomposition) • AMR, with FTT data structure • N-Body + Hydro only ( for the moment) • MG Poisson Solver on GPU+ MUSCL-Hancock Godunov Hydo Solver on GPU + Data Logistics on CPU • Hopefully will become EMMA ( E lectro M agnetism and M echanics on A MR) for gravity+hydro +radiation jeudi 13 décembre 2012

Fully Threaded Tree (Khokhlov 1997) (aka «Pointer Party») Particles → → → → → → Particles → → → → → → ART (Kravtsov et al. 1997) RAMSES (Teyssier 2001) jeudi 13 décembre 2012

Fully THREADED Tree In a lot of cases, the tree is explored Horizontally Level by Level (with some +/-1 level interactions at boundaries) Even CIC can be considered level by level jeudi 13 décembre 2012

Multi-levels Grid Multi-levels CIC density Potential (via relaxation) jeudi 13 décembre 2012

«Vectorization» AMR Tree CPU Gather Scatter GPU Flat Vector Leads to a bottle neck. Patch based AMR may be more appropriate (see e.g. Schive et al. 2009) jeudi 13 décembre 2012

How do we vectorize ? 7.6 -0.1 12.1 2.1 8.1 9.9 -1.8 0.3 -1.2 2.5 1.2 3.1 1 2 3 4 5 6 7 8 9 10 11 12 1.2 3.1 -1.2 2.5 8.1 9.9 -0.1 12.1 -1.8 7.6 2.1 0.3 Coalescent but large gather storing neighbor values -1.2 9.9 -1.8 7.6 ~Non-Coalescent but no gather storing neighbor adresses 3 6 10 11 jeudi 13 décembre 2012

AMR issues with Explicit formulation Hydro Level L+1 (fine) Radiation Hydro Level L (Coarse) Radiation Subcycling induce problematic inter-levels interaction It forces the hydro to be synchronized with radiation E.g Rosdahl & Blaizot reduces the speed of light by 10-100 and synchronize the hydro on a small radiation timestep jeudi 13 décembre 2012

Current Status Without optimizations ~X10-15 (DP) compared to CPU for Hydro. RT might kill it or increase it... jeudi 13 décembre 2012

(Toward) Radiative transfer on AMR with GPUs Dominique Aubert - PowerPoint PPT Presentation

(Toward) Radiative transfer on AMR with GPUs Dominique Aubert Universit de Strasbourg Austin, TX, 14.12.12 jeudi 13 dcembre 2012 A few words about GPUs Cache and control replaced by calculation units Large number of

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Radiative and non-radiative recombination There are two recombination that can occur in a

The AMR Group An n In Intr trod oduc uction tion 2013 2013 The Group The he AMR AMR

AMR and EPSRC AMR Networks Meeting, Sheffield, Sept 16 Christina Turner and Stephanie Newland

CS225: Spatial Computing Course Outline Instructor: Amr Magdy Computer Science and Engineering

Updating the RTP payload format for AMR and AMR-WB draft-ietf-avt-rtp-amr-bis-00.txt Magnus

WITH C++ Prof. Amr Goneid AUC Part 5. Functions Prof. amr Goneid, AUC 1 Functions Prof. amr

WITH C++ Prof. Amr Goneid AUC Part 12. Recursion Prof. amr Goneid, AUC 1 Recursion Prof. amr

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Neural AMR : Sequence-to-Sequence Models for Parsing and Generation annis Konstas joint work

CS260-002: Spatial Data Modeling and Analysis Course Outline Instructor: Amr Magdy Computer

WITH C++ Prof. Amr Goneid AUC Part 6. Simple and User Defined Data Types Prof. amr Goneid, AUC

WITH C++ Prof. Amr Goneid AUC Part 13. Abstract Data Types (ADTs) Prof. amr Goneid, AUC 1

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

WITH C++ Prof. Amr Goneid AUC Part 11a. The Vector Class Prof. amr Goneid, AUC 1 The Vector

WITH C++ Prof. Amr Goneid AUC Part 8. Characters & Strings Prof. amr Goneid, AUC 1

Gaia Space VRC Pilot Data Services Nicholas Walton & Guy Rixon (Institute of Astronomy)

From Performance Profiling to Predictive Analytics while evaluating Hadoop performance using

SESSION 1: THE FOUNDATIONS OF FINANCE Its always about money The old saying that it is

A M A M IXED ED V ER ON ERIFICATION S TRATEG EGY T AILOR ED FOR OR ORED N ET ORKS ON ON C C HIP

Leandro Soares Indrusiak http://www-users.cs.york.ac.uk/lsi Dagstuhl Seminar 15121 March 2015

Indirect searches in the PAMELA and Fermi era Aldo Morselli INFN, Sezione di Roma Tor Vergata

Scalability analysis of the distributed-memory implementation of the Aggregated unfitted Finite

Extreme Computational Cosmology Columbia University, NYC 19-22 dec 2005 Romain Teyssier Outline

Sambuz

Useful Links

Newsletter

Mail Us

(Toward) Radiative transfer on AMR with GPUs Dominique Aubert - PowerPoint PPT Presentation

(Toward) Radiative transfer on AMR with GPUs Dominique Aubert Universit de Strasbourg Austin, TX, 14.12.12 jeudi 13 dcembre 2012 A few words about GPUs Cache and control replaced by calculation units Large number of

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Radiative and non-radiative recombination There are two recombination that can occur in a

The AMR Group An n In Intr trod oduc uction tion 2013 2013 The Group The he AMR AMR

AMR and EPSRC AMR Networks Meeting, Sheffield, Sept 16 Christina Turner and Stephanie Newland

CS225: Spatial Computing Course Outline Instructor: Amr Magdy Computer Science and Engineering

Updating the RTP payload format for AMR and AMR-WB draft-ietf-avt-rtp-amr-bis-00.txt Magnus

WITH C++ Prof. Amr Goneid AUC Part 5. Functions Prof. amr Goneid, AUC 1 Functions Prof. amr

WITH C++ Prof. Amr Goneid AUC Part 12. Recursion Prof. amr Goneid, AUC 1 Recursion Prof. amr

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Neural AMR : Sequence-to-Sequence Models for Parsing and Generation annis Konstas joint work

CS260-002: Spatial Data Modeling and Analysis Course Outline Instructor: Amr Magdy Computer

WITH C++ Prof. Amr Goneid AUC Part 6. Simple and User Defined Data Types Prof. amr Goneid, AUC

WITH C++ Prof. Amr Goneid AUC Part 13. Abstract Data Types (ADTs) Prof. amr Goneid, AUC 1

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

WITH C++ Prof. Amr Goneid AUC Part 11a. The Vector Class Prof. amr Goneid, AUC 1 The Vector

WITH C++ Prof. Amr Goneid AUC Part 8. Characters &amp; Strings Prof. amr Goneid, AUC 1

Gaia Space VRC Pilot Data Services Nicholas Walton &amp; Guy Rixon (Institute of Astronomy)

From Performance Profiling to Predictive Analytics while evaluating Hadoop performance using

SESSION 1: THE FOUNDATIONS OF FINANCE Its always about money The old saying that it is

A M A M IXED ED V ER ON ERIFICATION S TRATEG EGY T AILOR ED FOR OR ORED N ET ORKS ON ON C C HIP

Leandro Soares Indrusiak http://www-users.cs.york.ac.uk/lsi Dagstuhl Seminar 15121 March 2015

Indirect searches in the PAMELA and Fermi era Aldo Morselli INFN, Sezione di Roma Tor Vergata

Scalability analysis of the distributed-memory implementation of the Aggregated unfitted Finite

Extreme Computational Cosmology Columbia University, NYC 19-22 dec 2005 Romain Teyssier Outline

Sambuz

Useful Links

Newsletter

Mail Us

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

WITH C++ Prof. Amr Goneid AUC Part 8. Characters & Strings Prof. amr Goneid, AUC 1

Gaia Space VRC Pilot Data Services Nicholas Walton & Guy Rixon (Institute of Astronomy)