Rolls-Royce Hydra on GPUs using OP2 I.Z. Reguly, G.R. Mudalige, M.B. - PowerPoint PPT Presentation

Rolls-Royce Hydra on GPUs using OP2 I.Z. Reguly, G.R. Mudalige, M.B. Giles, University of Oxford C. Bertolli, A. Betts, P.H.J. Kelly Imperial College London, (IBM TJ Watson) David Radford Rolls-Royce plc.

The Challenge • HPC is undergoing an enormous change – New hardware architectures – New parallel programming abstractions, languages • Flat (MPI) parallelism -> Multiple levels of parallel programming, heterogeneous systems (Titan, CORAL) • Getting high performance means specialization for the hardware • Code maintainability, longevity • “Future proofing”

Domain Specific Languages • Separate abstract specification of computations from the parallel implementation • High productivity for the domain scientist • High productivity for the library developer – Can experiment and validate on small benchmarks, results immediately apply to large-scale scientific codes • As hardware changes, the library adopts the latest and greatest features, optimizations – “User” code doesn’t change

Domain Specific Languages • Lots of research done on DSLs – Most of them wither away and die… • What are the obstacles to widespread adoption? – Critical mass – Usually applied to simple, toy problems – Little evidence that DSLs can be applied to industrial scale applications

Unstructured Meshes • For extremely complex cases, unstructured meshes are the only tool capable of delivering correct results. • Large, very complicated codebase Vorticity isosurface from a large Eddy Ground vortex ingestion simulation of a compressor

OP2 for Unstructured Grids 5 • Abstraction: 7 2 – Sets, maps, data 1 1 4 – Loop over sets, describing access type 6 3 2 res.h : void res(double *A, double *u, double *du) { 3 (*du) += (*A) * (*u); } Call “res” for each edge Iterate over edges ... op_par_loop(res,"res", edges, op_arg_dat(A,-1,OP_ID, 1 ,”double" ,OP_READ), With the op_arg_dat(u, 0,col,1 ,”double" ,OP_READ), following op_arg_dat(du,0,row,1 ,”double" ,OP_INC)); arguments

Rolls-Royce Hydra Full aircraft Hydra is an unstructured mesh production CFD application used at Rolls-Royce for simulating turbo-machinery of aircraft engines Internal Engine Noise blades Turbines

Rolls-Royce Hydra • Used for the design of turbomachinery – Key CFD production code – Steady and unsteady flow – Reynolds Averaged Navier-Stokes • In development for >15 years – Fortran 77 – 50k+ lines of source code – ~300 computational loops • Written in OPlus – same notions of sets, maps, data and loops over sets • Our goal is to evaluate the utility of OP2, when applied to Rolls- Royce Hydra

Conversion • The original source code had to be converted to use the OP2 API, keeping the “science” intact • Hydra was based on OPlus, the conversion was not difficult – Computations did not change, they were only outlined and described using the parallel loop API From an application developer point of view, this is it – the rest is about the library

Code generation • OP2-Hydra can do pure MPI right away, but performance is poor due to loss of optimizations (function pointers, outlined code, going through Fortran to C bindings) • Code generation for MPI can recover these optimizations • Python script parses op_par_loop calls in high-level files, replaces them with calls to generated code – Why not compilers?

Baseline performance OPlus PP vs. OP2 perfectly match, down to instruction count being within 5%. 2 socket Xeon E5-2640 2*12 cores 2.4GHz

Basic optimizations in OP2 • Support for ParMetis and PT-Scotch partitioning • Partial halo exchanges for boundary loops • Mesh renumbering to improve cache locality 2 socket Xeon E5-2640 2*12 cores 2.4GHz

We can match and outperform the original under the same circumstances That alone is great, but what else can OP2 do? Enable GPU execution of course...

Heterogeneous execution • Fine grain parallelism with CUDA or OpenMP • Code generation + pre- processing to support shared memory parallelism via coloring

Generating CUDA Fortran • A Fortran module for each “kernel” – Set up pointers, reductions on the host – CUDA kernel where threads set up the parameters, call the user function, do memory movement • Slight modifications to user kernel – Qualifiers, global constants

Challenges • Large number of computational kernels – Direct, Indirect read, Indirect Increment • Huge kernels – Datasets have up to 18 components (double precision values per set element) – Some kernels move up to 120 double precision values for each set element • It’s all about bandwidth utilization and occupancy

GPU optimizations • Through the code generator – Replace device constants (regexp) – Change to SoA access (regexp) var(m) -> var(nodes_stride*(m-1)+1), through OP2_SOA(var, nodes_stride,m) • Manually – Add intent(in) to variables to enable caching loads • Auto-tuning – Block sizes, register counts

GPU optimizations 35� 32.04� 30� 25.61� 25� (s)� Node: Execu on� me� Xeon E5-1650 @ 3.2 GHz 20� 2x Tesla K20m cards 15.21� 13.64� 15� 11.6� 1x Tesla K40 @ 875 MHz 8.8� 10� 7.4� 6.1� 1x Tesla K80 @ 875 MHz 5� PGI 14.7 0� Oplus K20 K20 K20 K20 Oplus� K20� K20� (SoA)� K20� K20� 2*K20� Tex 2x K20 K40 K40� K80� (MPI) no opt SoA Blksize CPU� (Ini al)� (Block� (Best)� (Best)� (Best)� (Best)� opt)�

Strong scaling 800K vertices, 2.5M edges. 1 Hector node (32 cores) and 1 Jade node (2 K20 GPUs) 32 OPlus 16 Runtime (Seconds) OP2 MPI (PTScotch) 8 OP2 MPI+CUDA (PTScotch) 4 2 1 0.5 0.25 1 2 4 8 16 32 64 128 Nodes Linear scaling up to 16 nodes (512 cores)

Weak scaling 0.5M vertices per node 16 Runtime (Seconds) 8 4 OPlus 2 OP2 MPI (PTScotch) OP2 MPI+CUDA (PTScotch) 1 1 2 4 8 16 Nodes GPU node has 2* over HECToR node

Hybrid CPU-GPU execution • Using the CPU and the GPU at the same time • Some processes use the CPU, some the GPU • How to load balance? Some loops are faster on the GPU, some on the CPU 18 16 1 GPU 1 GPU + CPU Run time (seconds) 14 12 10 8 accumedges 6 srcsa ifluxedge 4 vfluxedge edgecon 2 0 0.5 1 1.5 2 2.5 3 3.5 4 Partition size balance

Conclusions • DSLs can be applied to industrial-scale codes • Early version was slow: cost of a high-level API – Had to understand these limitations, code generate to circumvent them • Matching & increased performance on the same HW – By using OP2, some improved techniques come for “free” (renumbering, better partitioning, better MPI, etc.) • Enabled OpenMP, CUDA and CPU+GPU Hybrid execution – On such complicated code, the performance advantage is not huge – but the option is there! • All of these optimizations apply with no (or very little) change to the user code

Thank you! Questions? istvan.reguly@oerc.ox.ac.uk Acknowledgements : This research has been funded by the UK Technology Strategy Board and Rolls-Royce plc. through the Siloet project, the UK Engineering and Physical Sciences Research Council projects EP/I006079/1, EP/I00677X/1 on “Multi -layered Abstractions for PDEs” and the “Algorithms, Software for Emerging Architectures“ (ASEArch) EP/J010553/1 project. The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work. Special thanks to: Brent Leback (PGI), Maxim Milakov (NVIDIA), Leigh Lapworth, Paolo Adami, Yoon Ho (Rolls-Royce), Endre László (Oxford), Graham Markall, Fabio Luporini, David Ham, Florian Rathgeber (Imperial College), Lawrence Mitchell (Edinburgh)

Rolls-Royce Hydra on GPUs using OP2 I.Z. Reguly, G.R. Mudalige, M.B. - PowerPoint PPT Presentation

Rolls-Royce Hydra on GPUs using OP2 I.Z. Reguly, G.R. Mudalige, M.B. Giles, University of Oxford C. Bertolli, A. Betts, P.H.J. Kelly Imperial College London, (IBM TJ Watson) David Radford Rolls-Royce plc. The Challenge HPC is undergoing

THOMSON REUTERS STREETEVENTS Edited by Rolls Royce Thank you, John, and good afternoon, everyone.

Hydra: : a Python Framework a Python Framework Hydra for Parallel Computing for Parallel

Presentation by Nat Russhard (Rolls-Royce) JACG UID Working Group October 13-14 2004 Machine

Rolls-Royce Holdings plc 2014 Full-Year Results John Rishton, Chief Executive Trusted to deliver

Intelligent Testing Nigel Charman Engineering Manager Software and Systems Verification

Rolls-Royce Marine - The Environship Concept System Solutions & Wave Piercing Technology

Nuclear New Build Industrial Challenges A Rolls-Royce Perspective John Molyneux February

The UK - a natural home for global engineering and technology champions? Warren East January

Introducing the Hydra Software Hydra is professional Internet management software that manages

Software Process Product Process Management Engineering Process Process Development Project

jobshop scheduling We have a set of resources a set of jobs a job is a sequence of

jobshop scheduling We have a set of resources a set of jobs a job is a sequence of

Background The proceedings Carter Holt Harvey v Genesis Power and Rolls-Royce, NZHC CIV 2001-

Intangible Asset Valuation Wall Street is the only place that people ride to in a Rolls Royce to

Malcolm Budd BSc (Law) MCIPS Senior Export Control Manager (Corporate) Rolls-Royce plc 16 th

Where has all the cost data gone: Do we need it? Tuesday 11th June 2019 Rolls-Royce

Ridgeback Resources Corporate Presentation October 2019 Disclaimer Forward Looking

Future Danube Model H2020|Insurance model review meeting, London Martin Drews Kai Schrter, Max

Uniqueness, existence and regularity of stochastic Volterra integral equations Alexander Kalinin

CIBC Fixed Income Investor Presentation Q2 2020 Disclaimer The material that follows is a

Rupture EVAR Formula 1 style: it is all about team training www.critical-issues-congress.com Di

Structures Research Stephen Hallett Theme 2 The Structures Academic Team Stephen Alberto

CASCADES INC. Imperial Capital Global Opportunities Conference September 20, 2012 DISCLAIMER

BBA Aviation plc 2013 Final Results Results for the year ended 31 December 2013 For further

Sambuz

Useful Links

Newsletter

Mail Us

Rolls-Royce Hydra on GPUs using OP2 I.Z. Reguly, G.R. Mudalige, M.B. - PowerPoint PPT Presentation

Rolls-Royce Hydra on GPUs using OP2 I.Z. Reguly, G.R. Mudalige, M.B. Giles, University of Oxford C. Bertolli, A. Betts, P.H.J. Kelly Imperial College London, (IBM TJ Watson) David Radford Rolls-Royce plc. The Challenge HPC is undergoing

THOMSON REUTERS STREETEVENTS Edited by Rolls Royce Thank you, John, and good afternoon, everyone.

Hydra: : a Python Framework a Python Framework Hydra for Parallel Computing for Parallel

Presentation by Nat Russhard (Rolls-Royce) JACG UID Working Group October 13-14 2004 Machine

Rolls-Royce Holdings plc 2014 Full-Year Results John Rishton, Chief Executive Trusted to deliver

Intelligent Testing Nigel Charman Engineering Manager Software and Systems Verification

Rolls-Royce Marine - The Environship Concept System Solutions &amp; Wave Piercing Technology

Nuclear New Build Industrial Challenges A Rolls-Royce Perspective John Molyneux February

The UK - a natural home for global engineering and technology champions? Warren East January

Introducing the Hydra Software Hydra is professional Internet management software that manages

Software Process Product Process Management Engineering Process Process Development Project

jobshop scheduling We have a set of resources a set of jobs a job is a sequence of

jobshop scheduling We have a set of resources a set of jobs a job is a sequence of

Background The proceedings Carter Holt Harvey v Genesis Power and Rolls-Royce, NZHC CIV 2001-

Intangible Asset Valuation Wall Street is the only place that people ride to in a Rolls Royce to

Malcolm Budd BSc (Law) MCIPS Senior Export Control Manager (Corporate) Rolls-Royce plc 16 th

Where has all the cost data gone: Do we need it? Tuesday 11th June 2019 Rolls-Royce

Ridgeback Resources Corporate Presentation October 2019 Disclaimer Forward Looking

Future Danube Model H2020|Insurance model review meeting, London Martin Drews Kai Schrter, Max

Uniqueness, existence and regularity of stochastic Volterra integral equations Alexander Kalinin

CIBC Fixed Income Investor Presentation Q2 2020 Disclaimer The material that follows is a

Rupture EVAR Formula 1 style: it is all about team training www.critical-issues-congress.com Di

Structures Research Stephen Hallett Theme 2 The Structures Academic Team Stephen Alberto

CASCADES INC. Imperial Capital Global Opportunities Conference September 20, 2012 DISCLAIMER

BBA Aviation plc 2013 Final Results Results for the year ended 31 December 2013 For further

Sambuz

Useful Links

Newsletter

Mail Us

Rolls-Royce Marine - The Environship Concept System Solutions & Wave Piercing Technology