Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory Slide-1 Multicore Theory
Outline • Programming Challenge • Parallel Design • Example Issues • Theoretical Approach • Distributed Arrays • Integrated Process • Kuck Diagrams • Hierarchical Arrays • Tasks and Conduits • Summary MIT Lincoln Laboratory Slide-2 Multicore Theory
Multicore Programming Challenge Past Programming Model: Future Programming Model: Von Neumann ??? • • Great success of Moore’s Law era Processor topology includes: Simple model: load, op, store – – Registers, cache, local memory, remote memory, disk – Many transistors devoted to • delivering this model Cell has multiple programming • models Moore’s Law is ending – Need transistors for performance Can we describe how an algorithm is suppose to behave on a hierarchical heterogeneous multicore processor? MIT Lincoln Laboratory Slide-3 Multicore Theory
Example Issues Where is the data? How is distributed? Initialization policy? Where is it X , Y : N x N running? How does the data flow? Y = X + 1 What are the allowed messages size? Overlap computations Which binary to and communications? run? • A serial algorithm can run a serial processor with relatively little specification • A hierarchical heterogeneous multicore algorithm requires a lot more information MIT Lincoln Laboratory Slide-4 Multicore Theory
Theoretical Approach A : R N × P(N) Task1 S1 () M 0 M 0 M 0 M 0 Topic12 A n ⇒ 0 1 2 3 P 0 P 0 P 0 P 0 Conduit B : R N × P(N) Task2 S2 () M 0 M 0 M 0 M 0 2 Topic12 ⇒ m B 4 5 6 7 P 0 P 0 P 0 P 0 Replica 0 Replica 1 • Provide notation and diagrams that allow hierarchical heterogeneous multicore algorithms to be specified MIT Lincoln Laboratory Slide-5 Multicore Theory
Integrated Development Process X , Y : N x N X , Y : P(N) x N X , Y : P(P(N)) x N Y = X + 1 Y = X + 1 Y = X + 1 Cluster Embedded Desktop Computer 3. Deploy code 1. Develop serial code 2. Parallelize code 4. Automatically parallelize code • Should naturally support standard parallel embedded software development practices MIT Lincoln Laboratory Slide-6 Multicore Theory
Outline • Parallel Design • Serial Program • Distributed Arrays • Parallel Execution • Distributed Arrays • Kuck Diagrams • Redistribution • Hierarchical Arrays • Tasks and Conduits • Summary MIT Lincoln Laboratory Slide-7 Multicore Theory
Serial Program X , Y : N x N Y = X + 1 • Math is the language of algorithms • Allows mathematical expressions to be written concisely • Multi-dimensional arrays are fundamental to mathematics MIT Lincoln Laboratory Slide-8 Multicore Theory
Parallel Execution P ID =N P -1 P ID =1 P ID =0 X , Y : N x N Y = X + 1 • Run N P copies of same program – Single Program Multiple Data (SPMD) • Each copy has a unique P ID • Every array is replicated on each copy of the program MIT Lincoln Laboratory Slide-9 Multicore Theory
Distributed Array Program P ID =N P -1 P ID =1 P ID =0 X , Y : P(N) x N Y = X + 1 • Use P() notation to make a distributed array • Tells program which dimension to distribute data • Each program implicitly operates on only its own data (owner computes rule) MIT Lincoln Laboratory Slide-10 Multicore Theory
Explicitly Local Program X , Y : P(N) x N Y .loc = X .loc + 1 • Use .loc notation to explicitly retrieve local part of a distributed array • Operation is the same as serial program, but with different data on each processor (recommended approach) MIT Lincoln Laboratory Slide-11 Multicore Theory
Parallel Data Maps Array Math P(N) x N Computer N x P(N) 0 1 2 3 P ID P(N) x P(N) • A map is a mapping of array indices to processors • Can be block, cyclic, block-cyclic, or block w/overlap • Use P() notation to set which dimension to split among processors MIT Lincoln Laboratory Slide-12 Multicore Theory
Redistribution of Data X = Y = : P(N) x N P0 X P1 : N x P(N) Y Data Sent P2 Y = X + 1 P3 P0 P1 P2 P3 • Different distributed arrays can have different maps • Assignment between arrays with the “=” operator causes data to be redistributed • Underlying library determines all the message to send MIT Lincoln Laboratory Slide-13 Multicore Theory
Outline • Parallel Design • Distributed Arrays • Serial • Kuck Diagrams • Parallel • Hierarchical • Hierarchical Arrays • Cell • Tasks and Conduits • Summary MIT Lincoln Laboratory Slide-14 Multicore Theory
Single Processor Kuck Diagram A : R N × N M 0 P 0 • Processors denoted by boxes • Memory denoted by ovals • Lines connected associated processors and memories • Subscript denotes level in the memory hierarchy MIT Lincoln Laboratory Slide-15 Multicore Theory
Parallel Kuck Diagram A : R N × P(N) M 0 M 0 M 0 M 0 P 0 P 0 P 0 P 0 Net 0.5 • Replicates serial processors • Net denotes network connecting memories at a level in the hierarchy (incremented by 0.5) • Distributed array has a local piece on each memory MIT Lincoln Laboratory Slide-16 Multicore Theory
Hierarchical Kuck Diagram 2-LEVEL HIERARCHY The Kuck notation provides a The Kuck notation provides a SM 2 clear way of describing a clear way of describing a hardware architecture along with hardware architecture along with the memory and communication the memory and communication hierarchy hierarchy SMNet 2 Subscript indicates SM 1 SM 1 hierarchy level Legend: SMNet 1 SMNet 1 • P - processor • Net - inter-processor network M 0 M 0 M 0 M 0 • M - memory • SM - shared memory • SMNet - shared memory P 0 P 0 P 0 P 0 network Net 0.5 Net 0.5 x.5 subscript for Net indicates indirect memory Net 1.5 access MIT Lincoln Laboratory Slide-17 *High Performance Computing: Challenges for Future Systems , David Kuck, 1996 Multicore Theory
Cell Example Kuck diagram for the Sony/Toshiba/IBM processor Kuck diagram for the Sony/Toshiba/IBM processor P PPE = PPE speed (GFLOPS) M 0,PPE = size of PPE cache (bytes) P PPE -M 0,PPE =PPE to cache bandwidth (GB/sec) P SPE = SPE speed (GFLOPS) SPEs M 0,SPE = size of SPE local store (bytes) M 1 P SPE -M 0,SPE = SPE to LS memory bandwidth (GB/sec) PPE Net 0.5 = SPE to SPE bandwidth (matrix encoding topology, MNet 1 GB/sec) MNet 1 = PPE,SPE to main memory bandwidth (GB/sec) M 1 = size of main memory (bytes) M 0 M 0 M 0 M 0 M 0 M 0 0 1 2 3 7 P 0 P 0 P 0 P 0 P 0 P 0 Net 0.5 MIT Lincoln Laboratory Slide-18 Multicore Theory
Outline • Parallel Design • Distributed Arrays • Kuck Diagrams • Hierarchical Arrays • Hierarchical Arrays • Hierarchical Maps • Kuck Diagram • Tasks and Conduits • Explicitly Local Program • Summary MIT Lincoln Laboratory Slide-19 Multicore Theory
Hierarchical Arrays P ID Local arrays P ID Global array 0 1 0 0 1 1 0 2 1 3 0 1 • Hierarchical arrays allow algorithms to conform to hierarchical multicore processors • Each processor in P controls another set of processors P • Array local to P is sub-divided among local P processors MIT Lincoln Laboratory Slide-20 Multicore Theory
Hierarchical Array and Kuck Diagram A : R N × P(P(N)) A .loc SM 1 SM 1 SM net 1 SM net 1 A .loc.loc M 0 M 0 M 0 M 0 P 0 P 0 P 0 P 0 net 0.5 net 0.5 net 1.5 • Array is allocated across SM 1 of P processors • Within each SM 1 responsibility of processing is divided among local P processors • P processors will move their portion to their local M 0 MIT Lincoln Laboratory Slide-21 Multicore Theory
Explicitly Local Hierarchical Program X , Y : P(P(N)) x N Y .loc p .loc p = X .loc p .loc p + 1 • Extend .loc notation to explicitly retrieve local part of a local distributed array .loc.loc (assumes SPMD on P ) • Subscript p and p provide explicit access to (implicit otherwise) MIT Lincoln Laboratory Slide-22 Multicore Theory
Block Hierarchical Arrays in-core blk P ID Core blocks Local arrays P ID Global array 0 0 1 out-of- 0 1 core 0 2 1 1 b=4 3 0 2 1 0 3 0 1 1 2 3 • Memory constraints are common at the lowest level of the hierarchy • Blocking at this level allows control of the size of data operated on by each P MIT Lincoln Laboratory Slide-23 Multicore Theory
Block Hierarchical Program X , Y : P(P b(4) (N)) x N for i=0, X .loc.loc.n blk -1 Y .loc.loc.blk i = X .loc.loc.blk i + 1 • P b(4) indicates each sub-array should be broken up into blocks of size 4. • .n blk provides the number of blocks for looping over each block; allows controlling size of data on lowest level MIT Lincoln Laboratory Slide-24 Multicore Theory
Outline • Parallel Design • Distributed Arrays • Kuck Diagrams • Hierarchical Arrays • Basic Pipeline • Tasks and Conduits • Replicated Tasks • Replicated Pipelines • Summary MIT Lincoln Laboratory Slide-25 Multicore Theory
Recommend
More recommend