back from school
play

Back from School IN2P3 2016 Computing School Heterogeneous - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . Vincent Lafage lafage@ipno.in2p3.fr S2I, Institut de Physique Nuclaire dOrsay Universit Paris-Sud 9 October 2016 . . . . . . . . . . . . . . . . . . . . . . . . 1 /


  1. . . . . . . . . . . . . . . . . Vincent Lafage lafage@ipno.in2p3.fr S2I, Institut de Physique Nucléaire d’Orsay Université Paris-Sud 9 October 2016 . . . . . . . . . . . . . . . . . . . . . . . . 1 / 18 Back from School IN2P3 2016 Computing School « Heterogeneous Hardware Parallelism »

  2. . . . . . . . . . . . . . . . « Discover the major forms of parallelism ofgered by modern hardware architectures . Explore how to take advantage of these by combining multiple software technologies , looking for good balance of performance, code portability and durability . Most of the technologies presented here will be applied to a simplifjed example of particle collision simulation. » https://indico.in2p3.fr/event/13126/ . . . . . . . . . . . . . . . . . . . . . . . . . 2 / 18 Thema

  3. . . . . . . . . . . . . . . . . . . https://bitbucket.org/bixente/3photons/src . . . . . . . . . . . . . . . . . . 3 / 18 . . . . • the code ( à la Rosetta Code : F77, F90, Ada, C++, C & Go) e + e − → 3 γ • difgerent technologies implemented or simply mentionned: • OpenMP , • C++'11 & HPX , • OpenCL (+ MPI) , • Intel Threading Building Blocks TBB (C++) • DSL-Python • OpenCL for FPGA

  4. . . . . . . . . . . . . . . . · pseudo-random number generator · translate to a random event RAMBO « A new Monte Carlo treatment of multiparticle phase space at high energies » …and start again But you have two types of iterations : 1 the ones that depends upon previous ones… … but rather delightfully parallel 1 … as in embarrassment of riches . . . . . . . . . . . . . . 4 / 18 . . . . . . . . . . . Numerical integration ⇒ Monte-Carlo method (no adaptative refjnement) • generate an event with a weight, • check if it passes cuts, • compute matrix element, • weight it, • sum it up… 2 … and others ⇒ Monte-Carlo is embarrassingly parallel 1

  5. . . . . . . . . . . . . . . . . . « A portable high-quality random number generator for lattice fjeld theory simulations » https://www.deshawresearch.com/resources_random123.html http://dx.doi.org/10.1145/2063384.2063405 . . . . . . . . . . . . . . . . . . . . . . . 5 / 18 Pseudo Random Number 1 Linear congruential, Lehmer  ’s Knuth « Seminumerical Algorithms »  « Random numbers fall mainly in the planes »  (see RANDU problem) 2 RANLUX Lüscher  3 xorshift s Marsaglia  4 Mersenne Twister Matsumoto-Nishimura  5 Counter Based Random123  • satisfy rigorous statistical testing (BigCrush in TestU01), • vectorize and parallelize well (…> 2 64 independent streams), • have long periods (…> 2 128 ), • require little or no memory or state, • have excellent performance (a few cycles per random byte)

  6. . // Event generator . . . . . //!> Start of integration loop startTime = getticks (); int ev, evSelected = 0; double evWeight; for (ev=0; ev < run->nbrOfEvents; ev++) { // Reset Matrix elements resetME2 (&ee3p); evWeight = generateRambo (&rambo, outParticles , 3, run->ETotal); . evWeight = evWeight * run->cstVolume; // Sort outgoing photons by energy sortPhotons (outParticles); // Spinor inner product , scalar product and // center -of-mass frame angles computation computeSpinorProducts (&ee3p.spinor, ee3p.momenta); // intermediate computation computeScalarProducts (&ee3p); if (selectEvent (&pParameters , &ee3p)) { computeME2 (&ee3p, &pParameters , run->ETotal); updateStatistics (&statistics , &pParameters , &ee3p, evWeight); evSelected++; } } . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 / 18 Main loop

  7. . resetME2 (&ee3p); . . . . . //!> Start of integration loop startTime = getticks (); int ev, evSelected = 0; double evWeight; #pragma omp for reduction…(+:) for (ev=0; ev < run->nbrOfEvents; ev++) { // Reset Matrix elements // Event generator . evWeight = generateRambo (&rambo, outParticles , 3, run->ETotal); evWeight = evWeight * run->cstVolume; // Sort outgoing photons by energy sortPhotons (outParticles); // Spinor inner product , scalar product and // center -of-mass frame angles computation computeSpinorProducts (&ee3p.spinor, ee3p.momenta); // intermediate computation computeScalarProducts (&ee3p); if (selectEvent (&pParameters , &ee3p)) { computeME2 (&ee3p, &pParameters , run->ETotal); updateStatistics (&statistics , &pParameters , &ee3p, evWeight); evSelected++; } } . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 / 18 OpenMP'ified loop

  8. Moore’s Law: transition to many-core There is no escaping parallel computing any more even on a laptop. gain by 
 on-chip 
 parallelism ~10 years since serial 
 code just “got faster” 5 http://wccftech.com/moores-law-will-be-dead-2020-claim-experts-fab-process-cpug-pus-7-nm-5-nm/

  9. 1998-2014 : The memory Wall Year Proc GFlops GHz Cores SIMD GB/s 1998 Dec α 0.750 0.375 1 0.6 2014 Intel Xeon 500 2.6 2 × 14 AVX.2 68 × 1333 × 7 × 28 × 4 / 8 × 100 7/110 -

  10. R´ esum´ e Depuis 15 ans : Acc´ el´ eration des super-ordinateurs : × 1000 Acc´ el´ eration des nœuds : × 1000 Acc´ el´ eration par la fr´ equence : × 10 Acc´ el´ eration par le // SMP : × 10 Acc´ el´ eration par la vectorisation : × 10 Augmentation de la bande passante : × 100 - 14/110

  11. Intro Machines PyOpenCL Key Algorithm: Scan Loo.py Conclusions Moving data Data is moved through wires. Wires behave like an RC circuit. Trade-o ff : Longer response time (“latency”) Higher current (more power) Physics says: Communication is slow, power-hungry, or both. Andreas Kl¨ ockner DSL to Manycore

  12. . . . . . . . . . . . . … everything looks like a nail . Prefer Moving Work to the Data over Moving Data to the Work Not only we need to parallelize, but also we need that the arithmetic intensity of the problem is strong enough so that we are not memory bound : do not be blinded by the number of core in a GPU, nor the benchmarks. Scalar products and convolutions are limited by memory, but the product, inversion and diagonalization of matrices are uniquely designed to exploit vector architectures. To look at the problems through the eyes of a GPU coder, you need to identify matrix products. Theory simulations are likely to benefjt from GPU, as they usually rely on less data to transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 / 18 If all you have is a hammer,…. • … as initial input to GPU memory • … as fjnal output (aggregate) back from GPU memory

  13. . . . . . . . . . . . . . . . . . . (good space and time locality of data) + OpenMP 4+ . . . . . . . . . . . . . . . . . . 13 / 18 . . . . To vectorize efficiently,…. • fjnd SIMD parallelism • keep data aligned close to cores • choose your weapon • assembly language (+ antidepressant) • compiler vector intrinsics (+ aspirin) • good compiler (… for trivial loops) • proper libraries ( e.g. eigen, boost::simd,…)

  14. W H τ Introduction W τ W Choosing a SW technology H τ W τ W H τ W τ W H τ W τ W H τ W τ W http://libocca.org/talks/riceOG15.pdf H τ Efficiency Key points W τ W ● Efficient & Portable application H τ (changing technologies) W τ ● Standard free (cheap) technology W H τ ● Easy development (//ism not easy W τ task) W H τ ● Development tools (debugger, Portability W τ performance analysis) W H τ 10 Palaiseau, 23-27 May 2016 “ École informatique IN2P3 2016 ” W τ

  15. The C++ Standard • C++11 introduced lower level abstractions • std::thread, std::mutex, std::future, etc. • Fairly limited (low level), more is needed • C++ needs stronger support for higher-level parallelism • New standard: C++17: • Parallel versions of STL algorithms (P0024R2) • Several proposals to the Standardization Committee are accepted or under consideration • Technical Specification: Concurrency (N4577) • Other proposals: Coroutines (P0057R2), task blocks (N4411), executors (P0058R1) Massively Parallel Task-Based Programming with HPX 7/ 77 23.05.2016 | Thomas Heller | Computer Architecture – Department of Computer Science

  16. HPX 101 – API Overview Synchronous Asynchronous Fire & Forget R f(p...) (returns R ) (returns future<R> ) (returns void ) Functions f(p...) async(f, p...) apply(f, p...) (direct) C++ Functions bind(f, p...)(...) async(bind(f, p...), ...) apply(bind(f, p...), ...) (lazy) C++ Standard Library Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (direct) a()(id, p...) async(a(), id, p...) apply(a(), id, p...) Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (lazy) bind(a(), id, p...) async(bind(a(), id, p...), apply(bind(a(), id, p...), (...) ...) ...) HPX In Addition: dataflow(func, f1, f2); Massively Parallel Task-Based Programming with HPX 37/ 77 23.05.2016 | Thomas Heller | Computer Architecture – Department of Computer Science

Recommend


More recommend