Executing a Program on the MIT Tagged Token Dataflow Architecture Arvind and Nikhil
Notes on the paper • This is a “Big A” Architecture paper • It’s a PL, an ISA, and an execution model • and a dash of hardware
Execution Models: Von Neumann Von Neumann (CMP) Program counter Centralized Sequential To serialization points Instruction fetch Memory access 3
Execution Model: Dataflow Not a new idea [Dennis, ISCA’75] Programs are dataflow graphs Instructions fire when data arrives Instructions act independently + All ready instructions can fire at once Massive parallelism 4
Execution Model: Dataflow Not a new idea [Dennis, ISCA’75] Programs are dataflow graphs Instructions fire when data arrives 2 Instructions act independently + All ready instructions can fire at once Massive parallelism 4
Execution Model: Dataflow Not a new idea [Dennis, ISCA’75] Programs are dataflow graphs 2 Instructions fire when data arrives 2 Instructions act independently + + All ready instructions can fire at once Massive parallelism 4
Execution Model: Dataflow Not a new idea [Dennis, ISCA’75] Programs are dataflow graphs 2 Instructions fire when data arrives 2 Instructions act independently + + All ready instructions can fire at once Massive parallelism 4 4
Von Neumann example Mul t1 ← i, j Mul t2 ← i, i Add t3 ← A, t1 A[j + i*i] = i; Add t4 ← j, t2 Add t5 ← A, t4 b = A[i*j]; Store (t5) ← i Load b ← (t3) 5
Dataflow example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b 6
Dataflow example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b 7
Dataflow example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b 8
Dataflow example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b 9
Dataflow example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b 10
Dataflow example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b 11
Conditionals Use a switch operator No wasted work. Natural correspondence to if-then Switch P Can build loops T F Use a gated phi function (ala SSA) More parallelism -- defer predicate T F computation Not suitable for loops phi P Computing predicate is tricky (but solved) 12
Conditionals Use a “steering” operator. 13
Loops 14
Managing parallelism: Static dataflow Exactly one input on each dataflow arc at one time Finite state (~ the size of the dataflow graph) Scheduling is easy Parallelism limited by dataflow graph size (i.e. static instruction count) No loop parallelism. A B + 15
Managing Parallelism: Dynamic dataflow Dynamic dataflow Multiple inputs on an arc at one time Parallelism is possible -- pipeline iterations through the loops graph Unbounded state Circulation speed mismatch -- mis-matched inputs Tags are required. A B 1:A 2:B A B 3:A 1:B A B 2:A 3:B + + S 3:S S 2:S S 1:S 16
Dataflow tags Tags distinguish between different dynamic instances of the same value Tag management in TTDA Tags are the address of an activation record (aka stack frame) A dynamic instance of an “instruction block” has a tag. A central manager allocates/reclaims them. 17
Dataflow Granularity How big should the threads that “fire” be? Fine-grain In the limit, each instruction is a thread Maximum parallelism Lots synchronization overhead. Bounded # of inputs Coarse-grain Potentially less parallelism (in practice?) less synchronization overhead and variable inputs It’s had to beat straight-line code on a pipelined machine. 5-stages == 5-way parallelism Pretty good for short threads 18
Challenges in Dataflow Execution Building well-formed graphs. In von Neumann ISAs any sequence of instructions is valid Complex rules for well-formed dataflow graphs Detecting completion It is hard to tell when a fully distributed system is “finished” Preventing tag explosion k-loop bounding et. al. Executing “normal” languages. 19
j will probably run ahead of s. token pile up! But it might not Tokens out of order! 20
Id Elegant Determinate Functional Non-strict Implicit parallelism. I-structures Non-strictness is the least intuitive property Exposes enormous parallelism. Leadings to mind bending code. 21
I-structures A sort of dataflow-enabled storage element Simple rules Write/initialize once. Read from an uninitialized I-structure blocks. Read from an initialized I-structure returns. Write to an uninitialized I-structure unblocks reads Write to an initialized I-structure is an error. Implementation is tricky: you need a queue for for blocked reads. 22
In context Id never really went anywhere This paper is a good snap shot of late 80’s dataflow thinking. Eventually gives rise to OOO execution (a la HPS) Excellent example of vertical co-design. They rethought the whole system Almost always impractical Often yields great ideas. 23
Bits from your summaries How do you execute normal languages? How do you multitask? How does function linking work? Top-to-bottom design. Where’s the data? Would I-structures be useful today? 24
Recommend
More recommend