When Multicore Isnt Enough: Trends and the Future for - PowerPoint PPT Presentation

When Multicore Isn’t Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly Chief Engineer SiCortex, Inc 1 Monday, September 22, 2008 1

The Computational Model For a large set of interesting problems (N is number of independent processes) T sol = T arith /N + T mem /N + T IO + f(N)T comm or T sol = MAX(T arith /N, T mem /N, T IO , f(N)T comm ) For many interesting tasks, single CHIP performance is determined entirely by T mem and memory bandwidth. 2 Monday, September 22, 2008 2

Why Multicore? We don’t get faster cores as often as we get more of them 3 Source: SPEC2000 FP Reports http://www.spec.org/cpu/results/cpu2000.html Monday, September 22, 2008 3

Compute Node Design: A Memory Game T arith is becoming irrelevant. (Because N is getting large.) The design of the compute node is all about maximizing usable bandwidth between the compute elements and a large block of memory Multicore GPGPU Hybrid/ScalarVector ( e.g. Cell) The architecture choice drives the programming model, but all are otherwise interchangeable. 4 Monday, September 22, 2008 4

FFT Kernel If Arithmetic is free, but pins are limited... 5 Monday, September 22, 2008 5

Stencil (Convolution) Kernel 0.4 Bytes/FLOP? Then the processor spends 1/2 time waiting. 6 Monday, September 22, 2008 6

Alternatives Many fast cores on one die Require commensurate memory ports: high pin count High pin count and high processor count: large die A few fast cores on one die Better balance T arith : T mem Smaller die A few moderate cores on one die Balance T arith : T mem : T comm Spend pins on other features 7 Monday, September 22, 2008 7

Cubic Domain Decomposition Simple 7x7x7 “Jax” stencil operator over a large volume (1K 3 single precision): 19 flops per point. 8 Monday, September 22, 2008 8

Cubic Domain Decomposition Set a goal of completing a pass in 1mS. Faster processors complete larger chunks of the 9 total volume. Monday, September 22, 2008 9

Cubic Domain Decomposition Factor in T comm and we find that a 200MB/s per- node link forces a chunk size of 50 3 . 10 Monday, September 22, 2008 10

Cubic Domain Decomposition If the goal is “time per step,” computation speed may not matter. 11 GPUs, FPGAs, Magic Dust, don’t help. Monday, September 22, 2008 11

The Systems A Family: From Production Systems to Personal Development Workstations 12 Monday, September 22, 2008 12

The SC5832 5832 Processors 7.7TB Memory > 200 FC I/O Channels Single Linux System 16KW Cool and Reliable 13 Monday, September 22, 2008 13

SiCortex in the Technical Computing Ecosystem Affordable, easy-to-install, easy-to-maintain Development platforms for high processor count applications Rich cluster/MPI development environment Systems from 72 to 5832 processors Production platforms in target application areas: Multidimensional FFT Large Matrix Sorting/Searching 14 Monday, September 22, 2008 14

The SiCortex Node Chip Six 64 bit MIPS CPUs 1152 pin BGA CPU CPU 500 MHz, 1GF/s 170 sq mm double-precision 90nm L1 I/D Cache L1 I/D Cache 32+32 KB L1 Cache 256 KB L2 Cache ECC L2 Cache L2 Cache Processor/Memory PCI Express DMA Engine Switch Controller Fabric Switch DDR-2 DDR-2 and SERDES Controller Controller Units External I/O To/From other nodes 2 x 4 GB DDR-2 DIMMs 8-lane PCI Express 1.6GB/s per link Six way Linux SMP with 2 DDR ports, PCI Express, Message controller, and fabric switch 15 Monday, September 22, 2008 15

The SiCortex Module PCI Express I/O Memory Everything Else Compute: 162 GF/sec Memory b/w: 345 GB/sec Fabric b/w: 78 GB/sec Fabric Interconnect I/O b/w: 7.5 GB/sec Power: 500 Watts 16 8 Monday, September 22, 2008 16

The SiCortex System 36 Modules with Midplane Interconnect I/O Cable Management Fan Tray Power Distribution and System Service Processor 17 9 Monday, September 22, 2008 17

The Kautz Graph Logarithmic diameter Reconfigure around failures Low contention Very fast collectives 18 Monday, September 22, 2008 18

Thirty-six node Kautz graph 0 1 2 3 4 5 6 7 8 35 9 34 10 33 11 32 12 31 13 30 14 29 15 28 16 27 17 26 25 24 23 22 21 20 19 18 19 A pattern is developing Monday, September 22, 2008 19

Integrated HPC Linux Environment Operating System Linux kernel and utilities (2.6.18+) Cluster file system (Lustre) Development Environment GNU C, C++ Pathscale C, C++, Fortran Linux gentoo Math libraries Performance tools Debugger (TotalView) GNU MPI libraries (MPICH2) System Management Scheduler (SLURM) Partitioning Libraries MPI Monitoring Console, boot, diagnostics Maintenance and Support Factory-installed software Regular updates Open source build environment 20 5 Monday, September 22, 2008 20

Tuning Tools Serial code (hpcex) Comm (mpiex) IO (ioex) System (oprofile) Hardware (papiex) Visualization (tau, vampir) 21 Monday, September 22, 2008 21

Parallel File System Lustre Parallel File System Open Source Posix Compliant Native Implementation Uses DMA Engine Primitives Scalable Up to hundreds of I/O nodes 22 Monday, September 22, 2008 22

FabriCache RAM-backed file system Based on Lustre file system Store all data in Object Storage Server RAM Present data as a file system Scalable to 972 OSS nodes Similar to an SSD, but... Higher bandwidth / lower latency No external hardware required Creating/removing volumes is easier Useful for... Intermediate results Shared pools of data Staging data to/from Disk 23 Monday, September 22, 2008 23

MicroBenchmarks and Kernels • MPI Latency - 1.4 µsec • MPI BW - 1.5 GB/s • HPC Challenge work underway • SC5832, on 5772 cpus: – DGEMM 72% – HPL 3.6 TF (83% of DGEMM) – PTRANS 210 GB/s – STREAM 345 MB/s (1.9 TB/s aggregate) – FFT 174 GF – RandomRing 4 usec, 50 MB/s – RandomAccess 0.74 GUPS (5.5 optimized) 24 Monday, September 22, 2008 24

Zero contention message bandwidth? Interesting relationship between message size and 25 bandwidth Monday, September 22, 2008 25

Communication in “real world” conditions Contention matters. (For more, see Abhinav 26 Bhatele’s work at http://charm.cs.uiuc.edu/ .) Monday, September 22, 2008 26

What about Collectives? Dependence on vector size is predictable. 27 Monday, September 22, 2008 27

What can it do? The machine shines on problems that require lots of communication between processes. TeraByte Sort Three-Dimensional FFT Huge systems of equations 28 Monday, September 22, 2008 28

TeraByte Sort Sort 10 billion 100 byte records (10 byte key). Leave out the IO (this isn’t quite the Indy TeraSort benchmark) Use 5600 processors Key T comm attributes: Time to exchange all 1TB is about 4 sec +/- Time to copy each processor’s sublist is about 1 sec +/- Global AllReduce for a 256KB vector is O(10mS) 29 Monday, September 22, 2008 29

Tuning.... Improved QSort to the model target Bucket assignment is still very slow Exchange is still a little slow We can do better... 30 Monday, September 22, 2008 30

Three-Dimensional FFT 3D FFT of 1 billion point volume Use PFAFFT (prime factor analysis) complex-complex single precision 1040 x 1040 x 1040 Two target platforms: SC072 -- 72 processors SC1458 -- 1458 processors 31 Monday, September 22, 2008 31

Results! 65 Processor 3D FFT 1040 Processor 3D FFT 1D FFT 1D FFT 0.18 2.91 2D FFT 0.40 2D FFT 6.37 3D Transpose 3D Transpose 1.96 0.25 FFTW3 is now producing comparable results. 32 Monday, September 22, 2008 32

Product Directions Revere the model: T sol = T arith /N + T mem /N + T IO + f(N)T comm First generation emphasized T comm and T IO Second generation aimed at T mem and T arith while taking advantage of technology improvements for T comm and T IO More performance per watt/cubic-foot/dollar Richer IO infrastructure “Special purpose” configurations 33 Monday, September 22, 2008 33

Take-away SiCortex builds Linux clusters With purposed built components Optimized for high-communication applications High Processor Count Computing 34 Monday, September 22, 2008 34

When Multicore Isnt Enough: Trends and the Future for - PowerPoint PPT Presentation

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly Chief Engineer SiCortex, Inc 1 Monday, September 22, 2008 1 The Computational Model For a large set of interesting problems (N is number of

SERVER-SIDE RENDERING ISN'T ENOUGH SERVER-SIDE RENDERING ISN'T ENOUGH MATTHEW PHILLIPS MATTHEW

When Free Software Isn't Better When Free Software Isn't Better benjamin mako hill :: when

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

2015 Pipeline Safety Operator Training ISN April 15, 2015 Agenda I. Contractor Management

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Shunem 1. Sufficiency means enough to meet the situation; enough to accomplish the task.

Web Mapping? Why? How? Isn't Google enough? Jo Cook Senior IT Support and Development Oxford

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

The Impact of Multicore Multicore on on The Impact of Math Software Math Software and and

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

The Impact of Multicore Multicore on Math Software on Math Software The Impact of and

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

Blo lockchain hain Lo Look oking ing Thr hrough ough The he Pri rism sm World

Managing Terabytes Selena Deckelmann Emma, Inc - http://myemma.com PostgreSQL Global Development

The Data Driven Enterprise Considerations for changing your business by utilizing data Frankfurt,

OnMap Big Data Platform Content OnMap Platform Our Product Solutions Our Solutions

Collaborative Query Coordination in Community-Driven Data Grids Tobias Scholl, Angelika Reiser,

ABSLI Life Shield Plan UIN 109N109V03 A non-linked non participating term insurance plan 1

and Making Your Best Decision Tonights agenda Our goal this evening is to: Explain the

SRTS Legislative Update: SRTS Legislative Update: SRTS Legislative Update: SRTS Legislative

Sambuz

Useful Links

Newsletter

Mail Us

When Multicore Isnt Enough: Trends and the Future for - PowerPoint PPT Presentation

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly Chief Engineer SiCortex, Inc 1 Monday, September 22, 2008 1 The Computational Model For a large set of interesting problems (N is number of

SERVER-SIDE RENDERING ISN'T ENOUGH SERVER-SIDE RENDERING ISN'T ENOUGH MATTHEW PHILLIPS MATTHEW

When Free Software Isn't Better When Free Software Isn't Better benjamin mako hill :: when

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

2015 Pipeline Safety Operator Training ISN April 15, 2015 Agenda I. Contractor Management

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Shunem 1. Sufficiency means enough to meet the situation; enough to accomplish the task.

Web Mapping? Why? How? Isn't Google enough? Jo Cook Senior IT Support and Development Oxford

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

The Impact of Multicore Multicore on on The Impact of Math Software Math Software and and

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

The Impact of Multicore Multicore on Math Software on Math Software The Impact of and

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

Blo lockchain hain Lo Look oking ing Thr hrough ough The he Pri rism sm World

Managing Terabytes Selena Deckelmann Emma, Inc - http://myemma.com PostgreSQL Global Development

The Data Driven Enterprise Considerations for changing your business by utilizing data Frankfurt,

OnMap Big Data Platform Content OnMap Platform Our Product Solutions Our Solutions

Collaborative Query Coordination in Community-Driven Data Grids Tobias Scholl, Angelika Reiser,

ABSLI Life Shield Plan UIN 109N109V03 A non-linked non participating term insurance plan 1

and Making Your Best Decision Tonights agenda Our goal this evening is to: Explain the

SRTS Legislative Update: SRTS Legislative Update: SRTS Legislative Update: SRTS Legislative

Sambuz

Useful Links

Newsletter

Mail Us

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA