chep 2010
play

CHEP 2010 How to harness the performance potential How to harness - PowerPoint PPT Presentation

CHEP 2010 How to harness the performance potential How to harness the performance potential of current Multi-Core CPUs and GPUs Sverre Jarp CERN openlab IT Dept. CERN CERN Taipei, Monday 18 October 2010 CHEP 2010, Taipei Contents


  1. CHEP 2010 How to harness the performance potential How to harness the performance potential of current Multi-Core CPUs and GPUs Sverre Jarp CERN openlab IT Dept. CERN CERN Taipei, Monday 18 October 2010

  2. CHEP 2010, Taipei Contents Contents  The hardware situation The hardware situation  Current software Current software  Software prototypes Software prototypes Soft are protot pes Soft are protot pes  Some recommendations Some recommendations  Conclusions Conclusions 2 Sverre Jarp - CERN

  3. CHEP 2010, Taipei The The h hardware d situation situation 3 Sverre Jarp - CERN

  4. CHEP 2010, Taipei In the days of the Pentium In the days of the Pentium  Life was really simple: Life was really simple: Pipeline  Basically two dimensions B i ll t di i  The frequency of the pipeline Superscalar  The number of boxes Th b f b  The semiconductor industry increased the frequency Nodes  We acquired the right number of  We acquired the right number of (single-socket) boxes Sockets 4 Sverre Jarp - CERN

  5. CHEP 2010, Taipei Today: Seven dimensions of multiplicative performance Seven dimensions of multiplicative performance  First three dimensions:  Pipelined execution units Pipelining  Large superscalar design  Large superscalar design  Wide vector width (SIMD) Superscalar Superscalar  Next dimension is a “pseudo” dimension: dimension: Vector width Multithreading  Hardware multithreading Nodes  Last three dimensions:  Multiple cores p Sockets  Multiple sockets  Multiple compute nodes  Multiple compute nodes Multicore 5 SIMD = Single Instruction Multiple Data Sverre Jarp - CERN

  6. CHEP 2010, Taipei Moore’s law Moore s law  We continue to double the number of  We continue to double the number of transistors every other year  The consequences  The consequences  CPUs  Single core  Multicore  Manycore Si l  M lti  M  Vectors  Hardware threading H d th di  GPUs  Huge number of FMA units  Today we commonly acquire chips with 1’000’000’000 transistors! with 1’000’000’000 transistors! From Wikipedia Adapted from Wikipedia 6 Sverre Jarp - CERN

  7. CHEP 2010, Taipei Real consequence of Moore’s law Real consequence of Moore s law  We are being “drowned” in transistors:  We are being “drowned” in transistors:  More (and more complex) execution units  Hundreds of new instructions  Longer SIMD vectors  Large number of cores Large number of cores  More hardware threading  In order to profit we need to “think parallel” p p  Data parallelism  Task parallelism 7 Sverre Jarp - CERN

  8. CHEP 2010, Taipei Four floating point data flavours (256b) Four floating-point data flavours (256b)  Longer vectors: Longer vectors:  AVX (Advanced Vector eXtension) is coming:  As of next year, vectors will be 256 bits in length  Intel’s “Sandy Bridge” first (others are coming, also from AMD)  E0 E0 Single precision Si l i i - - - - - - -  Scalar single (SS)  Packed single (PS)  Packed single (PS) E7 E7 E6 E6 E5 E5 E4 E4 E3 E3 E2 E2 E1 E1 E0 E0   Double precision Double precision E0 - - -  Scalar Double (SD)  Packed Double (PD) Packed Double (PD) E1 E0 E3 E2 Without vectors in our software we will use Without vectors in our software, we will use 1/4 or 1/8 of the available execution width 8 Sverre Jarp - CERN

  9. CHEP 2010, Taipei The move to many-core systems The move to many core systems  Examples of “CPU slots”: Sockets * Cores * HW-threads  Examples of CPU slots : Sockets Cores HW-threads  Basically what you observe in “cat /proc/cpuinfo”  Conservative:  Conservative:  Dual-socket AMD six-core (Istanbul): 2 * 6 * 1 = 12  Dual socket Intel six core (Westmere):  Dual-socket Intel six-core (Westmere): 2 * 6 * 2 = 24 2 6 2 = 24  Aggressive:  Quad-socket AMD Magny-Cours (12-core) 4 * 12 * 1 = 48  Quad-socket Nehalem-EX “octo-core”: 4 * 8 * 2 = 64  In the near future: Hundreds of CPU slots !  Quad-socket Sun Niagara (T3) processors w/16 cores and 8 Quad socket Sun Niagara (T3) processors w/16 cores and 8 threads (each): 4 * 16 * 8 = 512  And by the time new software is ready: Thousands !!  And, by the time new software is ready: Thousands !! 9 Sverre Jarp - CERN

  10. CHEP 2010, Taipei Accelerators (1): Intel MIC Accelerators (1): Intel MIC  Many Integrated Core architecture:  Announced at ISC10 (June 2010)  Based on the x86 architecture, 22nm ( in 2012?) Based on the x86 architecture, 22nm ( in 2012?)  Many-core (> 50 cores) + 4-way multithreaded + 512-bit vector unit  Limited memory: Few Gigabytes In Order, 4 In Order, 4 In Order, 4 . . . . . . . . . . . . threads, SIMD-16 threads, SIMD-16 nction threads, SIMD-16 erface splay Fixed I$ D$ I$ D$ ler ler Dis Inte ory Controll F ory Controll Fu  L2 Cache Memo Memo Interface Texture System Logic In Order, 4 In Order, 4 . . . . . . . . . . . . . . . . . . . . . . . . threads, SIMD-16 threads, SIMD-16 I$ I$ D$ D$ I$ I$ D$ D$ 10 Sverre Jarp - CERN

  11. CHEP 2010, Taipei Accelerators (2): Nvidia Fermi GPU Accelerators (2): Nvidia Fermi GPU Instruction Cache Instruction Cache  Streaming Multiprocessing  Streaming Multiprocessing Scheduler Scheduler Scheduler Scheduler (SM) Architecture Dispatch Dispatch Dispatch Dispatch Register File Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core  32 “CUDA cores” per SM (512 total) Core Core Core Core Core Core Core Core  Core Core Core Core Core Core Core Core Peak single precision floating point g p g p performance (at 1.15 GHz”: Core Core Core Core Core Core Core Core  Above 1 Tflop Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core  Double-precision: 50% Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core  D Dual Thread Scheduler l Th d S h d l Load/Store Units x 16 Lots of Special Func Units x 4  Interconnect Network Interconnect Network 64 KB of RAM for shared memory and interest in the interest in the 64K C 64K C 64K Configurable 64K Configurable fi fi bl bl L1 cache (configurable) Cache/Shared Cache/Shared Mem Mem HEP on-line Uniform Cache Uniform Cache community y  A few Gigabytes of main memory g y y Adapted from Nvidia 11 Sverre Jarp - CERN

  12. CHEP 2010, Taipei Current Current software software 12 Sverre Jarp - CERN

  13. CHEP 2010, Taipei SW performance: A complicated story! SW performance: A complicated story!  We start with a concrete real life problem to solve  We start with a concrete, real-life problem to solve  For instance, simulate the passage of elementary particles through matter through matter  We write programs in high level languages  C++, JAVA, Python, etc.  A compiler (or an interpreter) transforms the high-level code to A compiler (or an interpreter) transforms the high level code to machine-level code  We link in external libraries  A sophisticated processor with a complex architecture and A sophisticated processor with a complex architecture and even more complex micro-architecture executes the code  In most cases, we have little clue as to the efficiency of this In most cases e ha e little cl e as to the efficienc of this transformation process 13 Sverre Jarp - CERN

  14. CHEP 2010, Taipei We need forward scalability  Not only should a program be written in such a way that it extracts maximum performance from today’s hardware extracts maximum performance from today s hardware  On future processors, performance should scale automatically  In the worst case, one would have to recompile or relink  Additional CPU/GPU hardware, be it cores/threads or vectors would automatically be put to good use vectors, would automatically be put to good use  Scaling would be as expected: g p  If the number of cores (or the vector size) doubled:  Scaling would be close to 2x, but certainly not just a few percent g y j p  We cannot afford to “rewrite” our software for every hardware change! 14 Sverre Jarp - CERN

  15. CHEP 2010, Taipei Concurrency in HEP Concurrency in HEP  We are “blessed” with lots of it:  Entire events E ti t  Particles, hits, tracks and vertices  Physics processes  I/O streams (ROOT trees branches) I/O streams (ROOT trees, branches)  Buffer manipulations (also data compaction, etc.)  Fitting variables  Partial sums, partial histograms  and many others …..  Usable for both data and task parallelism!  But fine grained parallelism is not well exposed in  But, fine-grained parallelism is not well exposed in today’s software frameworks 15 Sverre Jarp - CERN

Recommend


More recommend