when multicore isn t enough trends and the future for
play

When Multicore Isnt Enough: Trends and the Future for - PowerPoint PPT Presentation

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly Chief Engineer SiCortex, Inc 1 Monday, September 22, 2008 1 The Computational Model For a large set of interesting problems (N is number of


  1. When Multicore Isn’t Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly Chief Engineer SiCortex, Inc 1 Monday, September 22, 2008 1

  2. The Computational Model For a large set of interesting problems (N is number of independent processes) T sol = T arith /N + T mem /N + T IO + f(N)T comm or T sol = MAX(T arith /N, T mem /N, T IO , f(N)T comm ) For many interesting tasks, single CHIP performance is determined entirely by T mem and memory bandwidth. 2 Monday, September 22, 2008 2

  3. Why Multicore? We don’t get faster cores as often as we get more of them 3 Source: SPEC2000 FP Reports http://www.spec.org/cpu/results/cpu2000.html Monday, September 22, 2008 3

  4. Compute Node Design: A Memory Game T arith is becoming irrelevant. (Because N is getting large.) The design of the compute node is all about maximizing usable bandwidth between the compute elements and a large block of memory Multicore GPGPU Hybrid/ScalarVector ( e.g. Cell) The architecture choice drives the programming model, but all are otherwise interchangeable. 4 Monday, September 22, 2008 4

  5. FFT Kernel If Arithmetic is free, but pins are limited... 5 Monday, September 22, 2008 5

  6. Stencil (Convolution) Kernel 0.4 Bytes/FLOP? Then the processor spends 1/2 time waiting. 6 Monday, September 22, 2008 6

  7. Alternatives Many fast cores on one die Require commensurate memory ports: high pin count High pin count and high processor count: large die A few fast cores on one die Better balance T arith : T mem Smaller die A few moderate cores on one die Balance T arith : T mem : T comm Spend pins on other features 7 Monday, September 22, 2008 7

  8. Cubic Domain Decomposition Simple 7x7x7 “Jax” stencil operator over a large volume (1K 3 single precision): 19 flops per point. 8 Monday, September 22, 2008 8

  9. Cubic Domain Decomposition Set a goal of completing a pass in 1mS. Faster processors complete larger chunks of the 9 total volume. Monday, September 22, 2008 9

  10. Cubic Domain Decomposition Factor in T comm and we find that a 200MB/s per- node link forces a chunk size of 50 3 . 10 Monday, September 22, 2008 10

  11. Cubic Domain Decomposition If the goal is “time per step,” computation speed may not matter. 11 GPUs, FPGAs, Magic Dust, don’t help. Monday, September 22, 2008 11

  12. The Systems A Family: From Production Systems to Personal Development Workstations 12 Monday, September 22, 2008 12

  13. The SC5832 5832 Processors 7.7TB Memory > 200 FC I/O Channels Single Linux System 16KW Cool and Reliable 13 Monday, September 22, 2008 13

  14. SiCortex in the Technical Computing Ecosystem Affordable, easy-to-install, easy-to-maintain Development platforms for high processor count applications Rich cluster/MPI development environment Systems from 72 to 5832 processors Production platforms in target application areas: Multidimensional FFT Large Matrix Sorting/Searching 14 Monday, September 22, 2008 14

  15. The SiCortex Node Chip Six 64 bit MIPS CPUs 1152 pin BGA CPU CPU 500 MHz, 1GF/s 170 sq mm double-precision 90nm L1 I/D Cache L1 I/D Cache 32+32 KB L1 Cache 256 KB L2 Cache ECC L2 Cache L2 Cache Processor/Memory PCI Express DMA Engine Switch Controller Fabric Switch DDR-2 DDR-2 and SERDES Controller Controller Units External I/O To/From other nodes 2 x 4 GB DDR-2 DIMMs 8-lane PCI Express 1.6GB/s per link Six way Linux SMP with 2 DDR ports, PCI Express, Message controller, and fabric switch 15 Monday, September 22, 2008 15

  16. The SiCortex Module PCI Express I/O Memory Everything Else Compute: 162 GF/sec Memory b/w: 345 GB/sec Fabric b/w: 78 GB/sec Fabric Interconnect I/O b/w: 7.5 GB/sec Power: 500 Watts 16 8 Monday, September 22, 2008 16

  17. The SiCortex System 36 Modules with Midplane Interconnect I/O Cable Management Fan Tray Power Distribution and System Service Processor 17 9 Monday, September 22, 2008 17

  18. The Kautz Graph Logarithmic diameter Reconfigure around failures Low contention Very fast collectives 18 Monday, September 22, 2008 18

  19. Thirty-six node Kautz graph 0 1 2 3 4 5 6 7 8 35 9 34 10 33 11 32 12 31 13 30 14 29 15 28 16 27 17 26 25 24 23 22 21 20 19 18 19 A pattern is developing Monday, September 22, 2008 19

  20. Integrated HPC Linux Environment Operating System Linux kernel and utilities (2.6.18+) Cluster file system (Lustre) Development Environment GNU C, C++ Pathscale C, C++, Fortran Linux gentoo Math libraries Performance tools Debugger (TotalView) GNU MPI libraries (MPICH2) System Management Scheduler (SLURM) Partitioning Libraries MPI Monitoring Console, boot, diagnostics Maintenance and Support Factory-installed software Regular updates Open source build environment 20 5 Monday, September 22, 2008 20

  21. Tuning Tools Serial code (hpcex) Comm (mpiex) IO (ioex) System (oprofile) Hardware (papiex) Visualization (tau, vampir) 21 Monday, September 22, 2008 21

  22. Parallel File System Lustre Parallel File System Open Source Posix Compliant Native Implementation Uses DMA Engine Primitives Scalable Up to hundreds of I/O nodes 22 Monday, September 22, 2008 22

  23. FabriCache RAM-backed file system Based on Lustre file system Store all data in Object Storage Server RAM Present data as a file system Scalable to 972 OSS nodes Similar to an SSD, but... Higher bandwidth / lower latency No external hardware required Creating/removing volumes is easier Useful for... Intermediate results Shared pools of data Staging data to/from Disk 23 Monday, September 22, 2008 23

  24. MicroBenchmarks and Kernels • MPI Latency - 1.4 µsec • MPI BW - 1.5 GB/s • HPC Challenge work underway • SC5832, on 5772 cpus: – DGEMM 72% – HPL 3.6 TF (83% of DGEMM) – PTRANS 210 GB/s – STREAM 345 MB/s (1.9 TB/s aggregate) – FFT 174 GF – RandomRing 4 usec, 50 MB/s – RandomAccess 0.74 GUPS (5.5 optimized) 24 Monday, September 22, 2008 24

  25. Zero contention message bandwidth? Interesting relationship between message size and 25 bandwidth Monday, September 22, 2008 25

  26. Communication in “real world” conditions Contention matters. (For more, see Abhinav 26 Bhatele’s work at http://charm.cs.uiuc.edu/ .) Monday, September 22, 2008 26

  27. What about Collectives? Dependence on vector size is predictable. 27 Monday, September 22, 2008 27

  28. What can it do? The machine shines on problems that require lots of communication between processes. TeraByte Sort Three-Dimensional FFT Huge systems of equations 28 Monday, September 22, 2008 28

  29. TeraByte Sort Sort 10 billion 100 byte records (10 byte key). Leave out the IO (this isn’t quite the Indy TeraSort benchmark) Use 5600 processors Key T comm attributes: Time to exchange all 1TB is about 4 sec +/- Time to copy each processor’s sublist is about 1 sec +/- Global AllReduce for a 256KB vector is O(10mS) 29 Monday, September 22, 2008 29

  30. Tuning.... Improved QSort to the model target Bucket assignment is still very slow Exchange is still a little slow We can do better... 30 Monday, September 22, 2008 30

  31. Three-Dimensional FFT 3D FFT of 1 billion point volume Use PFAFFT (prime factor analysis) complex-complex single precision 1040 x 1040 x 1040 Two target platforms: SC072 -- 72 processors SC1458 -- 1458 processors 31 Monday, September 22, 2008 31

  32. Results! 65 Processor 3D FFT 1040 Processor 3D FFT 1D FFT 1D FFT 0.18 2.91 2D FFT 0.40 2D FFT 6.37 3D Transpose 3D Transpose 1.96 0.25 FFTW3 is now producing comparable results. 32 Monday, September 22, 2008 32

  33. Product Directions Revere the model: T sol = T arith /N + T mem /N + T IO + f(N)T comm First generation emphasized T comm and T IO Second generation aimed at T mem and T arith while taking advantage of technology improvements for T comm and T IO More performance per watt/cubic-foot/dollar Richer IO infrastructure “Special purpose” configurations 33 Monday, September 22, 2008 33

  34. Take-away SiCortex builds Linux clusters With purposed built components Optimized for high-communication applications High Processor Count Computing 34 Monday, September 22, 2008 34

Recommend


More recommend