Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT - PowerPoint PPT Presentation

Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory Slide-1 Multicore Theory

Outline • Programming Challenge • Parallel Design • Example Issues • Theoretical Approach • Distributed Arrays • Integrated Process • Kuck Diagrams • Hierarchical Arrays • Tasks and Conduits • Summary MIT Lincoln Laboratory Slide-2 Multicore Theory

Multicore Programming Challenge Past Programming Model: Future Programming Model: Von Neumann ??? • • Great success of Moore’s Law era Processor topology includes: Simple model: load, op, store – – Registers, cache, local memory, remote memory, disk – Many transistors devoted to • delivering this model Cell has multiple programming • models Moore’s Law is ending – Need transistors for performance Can we describe how an algorithm is suppose to behave on a hierarchical heterogeneous multicore processor? MIT Lincoln Laboratory Slide-3 Multicore Theory

Example Issues Where is the data? How is distributed? Initialization policy? Where is it X , Y : N x N running? How does the data flow? Y = X + 1 What are the allowed messages size? Overlap computations Which binary to and communications? run? • A serial algorithm can run a serial processor with relatively little specification • A hierarchical heterogeneous multicore algorithm requires a lot more information MIT Lincoln Laboratory Slide-4 Multicore Theory

Theoretical Approach A : R N × P(N) Task1 S1 () M 0 M 0 M 0 M 0 Topic12 A n ⇒ 0 1 2 3 P 0 P 0 P 0 P 0 Conduit B : R N × P(N) Task2 S2 () M 0 M 0 M 0 M 0 2 Topic12 ⇒ m B 4 5 6 7 P 0 P 0 P 0 P 0 Replica 0 Replica 1 • Provide notation and diagrams that allow hierarchical heterogeneous multicore algorithms to be specified MIT Lincoln Laboratory Slide-5 Multicore Theory

Integrated Development Process X , Y : N x N X , Y : P(N) x N X , Y : P(P(N)) x N Y = X + 1 Y = X + 1 Y = X + 1 Cluster Embedded Desktop Computer 3. Deploy code 1. Develop serial code 2. Parallelize code 4. Automatically parallelize code • Should naturally support standard parallel embedded software development practices MIT Lincoln Laboratory Slide-6 Multicore Theory

Outline • Parallel Design • Serial Program • Distributed Arrays • Parallel Execution • Distributed Arrays • Kuck Diagrams • Redistribution • Hierarchical Arrays • Tasks and Conduits • Summary MIT Lincoln Laboratory Slide-7 Multicore Theory

Serial Program X , Y : N x N Y = X + 1 • Math is the language of algorithms • Allows mathematical expressions to be written concisely • Multi-dimensional arrays are fundamental to mathematics MIT Lincoln Laboratory Slide-8 Multicore Theory

Parallel Execution P ID =N P -1 P ID =1 P ID =0 X , Y : N x N Y = X + 1 • Run N P copies of same program – Single Program Multiple Data (SPMD) • Each copy has a unique P ID • Every array is replicated on each copy of the program MIT Lincoln Laboratory Slide-9 Multicore Theory

Distributed Array Program P ID =N P -1 P ID =1 P ID =0 X , Y : P(N) x N Y = X + 1 • Use P() notation to make a distributed array • Tells program which dimension to distribute data • Each program implicitly operates on only its own data (owner computes rule) MIT Lincoln Laboratory Slide-10 Multicore Theory

Explicitly Local Program X , Y : P(N) x N Y .loc = X .loc + 1 • Use .loc notation to explicitly retrieve local part of a distributed array • Operation is the same as serial program, but with different data on each processor (recommended approach) MIT Lincoln Laboratory Slide-11 Multicore Theory

Parallel Data Maps Array Math P(N) x N Computer N x P(N) 0 1 2 3 P ID P(N) x P(N) • A map is a mapping of array indices to processors • Can be block, cyclic, block-cyclic, or block w/overlap • Use P() notation to set which dimension to split among processors MIT Lincoln Laboratory Slide-12 Multicore Theory

Redistribution of Data X = Y = : P(N) x N P0 X P1 : N x P(N) Y Data Sent P2 Y = X + 1 P3 P0 P1 P2 P3 • Different distributed arrays can have different maps • Assignment between arrays with the “=” operator causes data to be redistributed • Underlying library determines all the message to send MIT Lincoln Laboratory Slide-13 Multicore Theory

Outline • Parallel Design • Distributed Arrays • Serial • Kuck Diagrams • Parallel • Hierarchical • Hierarchical Arrays • Cell • Tasks and Conduits • Summary MIT Lincoln Laboratory Slide-14 Multicore Theory

Single Processor Kuck Diagram A : R N × N M 0 P 0 • Processors denoted by boxes • Memory denoted by ovals • Lines connected associated processors and memories • Subscript denotes level in the memory hierarchy MIT Lincoln Laboratory Slide-15 Multicore Theory

Parallel Kuck Diagram A : R N × P(N) M 0 M 0 M 0 M 0 P 0 P 0 P 0 P 0 Net 0.5 • Replicates serial processors • Net denotes network connecting memories at a level in the hierarchy (incremented by 0.5) • Distributed array has a local piece on each memory MIT Lincoln Laboratory Slide-16 Multicore Theory

Hierarchical Kuck Diagram 2-LEVEL HIERARCHY The Kuck notation provides a The Kuck notation provides a SM 2 clear way of describing a clear way of describing a hardware architecture along with hardware architecture along with the memory and communication the memory and communication hierarchy hierarchy SMNet 2 Subscript indicates SM 1 SM 1 hierarchy level Legend: SMNet 1 SMNet 1 • P - processor • Net - inter-processor network M 0 M 0 M 0 M 0 • M - memory • SM - shared memory • SMNet - shared memory P 0 P 0 P 0 P 0 network Net 0.5 Net 0.5 x.5 subscript for Net indicates indirect memory Net 1.5 access MIT Lincoln Laboratory Slide-17 *High Performance Computing: Challenges for Future Systems , David Kuck, 1996 Multicore Theory

Cell Example Kuck diagram for the Sony/Toshiba/IBM processor Kuck diagram for the Sony/Toshiba/IBM processor P PPE = PPE speed (GFLOPS) M 0,PPE = size of PPE cache (bytes) P PPE -M 0,PPE =PPE to cache bandwidth (GB/sec) P SPE = SPE speed (GFLOPS) SPEs M 0,SPE = size of SPE local store (bytes) M 1 P SPE -M 0,SPE = SPE to LS memory bandwidth (GB/sec) PPE Net 0.5 = SPE to SPE bandwidth (matrix encoding topology, MNet 1 GB/sec) MNet 1 = PPE,SPE to main memory bandwidth (GB/sec) M 1 = size of main memory (bytes) M 0 M 0 M 0 M 0 M 0 M 0 0 1 2 3 7 P 0 P 0 P 0 P 0 P 0 P 0 Net 0.5 MIT Lincoln Laboratory Slide-18 Multicore Theory

Outline • Parallel Design • Distributed Arrays • Kuck Diagrams • Hierarchical Arrays • Hierarchical Arrays • Hierarchical Maps • Kuck Diagram • Tasks and Conduits • Explicitly Local Program • Summary MIT Lincoln Laboratory Slide-19 Multicore Theory

Hierarchical Arrays P ID Local arrays P ID Global array 0 1 0 0 1 1 0 2 1 3 0 1 • Hierarchical arrays allow algorithms to conform to hierarchical multicore processors • Each processor in P controls another set of processors P • Array local to P is sub-divided among local P processors MIT Lincoln Laboratory Slide-20 Multicore Theory

Hierarchical Array and Kuck Diagram A : R N × P(P(N)) A .loc SM 1 SM 1 SM net 1 SM net 1 A .loc.loc M 0 M 0 M 0 M 0 P 0 P 0 P 0 P 0 net 0.5 net 0.5 net 1.5 • Array is allocated across SM 1 of P processors • Within each SM 1 responsibility of processing is divided among local P processors • P processors will move their portion to their local M 0 MIT Lincoln Laboratory Slide-21 Multicore Theory

Explicitly Local Hierarchical Program X , Y : P(P(N)) x N Y .loc p .loc p = X .loc p .loc p + 1 • Extend .loc notation to explicitly retrieve local part of a local distributed array .loc.loc (assumes SPMD on P ) • Subscript p and p provide explicit access to (implicit otherwise) MIT Lincoln Laboratory Slide-22 Multicore Theory

Block Hierarchical Arrays in-core blk P ID Core blocks Local arrays P ID Global array 0 0 1 out-of- 0 1 core 0 2 1 1 b=4 3 0 2 1 0 3 0 1 1 2 3 • Memory constraints are common at the lowest level of the hierarchy • Blocking at this level allows control of the size of data operated on by each P MIT Lincoln Laboratory Slide-23 Multicore Theory

Block Hierarchical Program X , Y : P(P b(4) (N)) x N for i=0, X .loc.loc.n blk -1 Y .loc.loc.blk i = X .loc.loc.blk i + 1 • P b(4) indicates each sub-array should be broken up into blocks of size 4. • .n blk provides the number of blocks for looping over each block; allows controlling size of data on lowest level MIT Lincoln Laboratory Slide-24 Multicore Theory

Outline • Parallel Design • Distributed Arrays • Kuck Diagrams • Hierarchical Arrays • Basic Pipeline • Tasks and Conduits • Replicated Tasks • Replicated Pipelines • Summary MIT Lincoln Laboratory Slide-25 Multicore Theory

Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT - PowerPoint PPT Presentation

Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Multicore Based Packet Splitting Multicore Based Packet Splitting Approaches for High Speed

The Impact of Multicore Multicore on on The Impact of Math Software Math Software and and

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

T-106.5800 Seminar on Software Techniques Seminar on Multicore Programming Multicore Technology

Developing Effective Industry / Education Partnerships: An Employer Perspective Richard L.

Advances in Seasonal Adjustment Software at the U. S. Census Bureau Brian Monsell Advances in

A Net-based Formal Framework for Causal Loop Diagrams Guillermina Cledou 1 and Shin Nakajima 2 1

The bank-lending channel: The IS-BL model Christian Groth University of Copenhagen November 21,

Summer School on Multidimensional Poverty 819 July 2013 Institute for International Economic

WP3 MARITIME TRAINING (22 person-months, start: M0, end M36) Jakob Pinkster STC Group

Earnings Call Presentation Zayo Group Holdings, Inc. Fiscal Year 2017 Q4 NYSE: ZAYO @ZayoGroup

eCTD Module 1.3 Best Practices Emma Richards Evaluation Management Section Prescription