Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin - PowerPoint PPT Presentation

Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin Ken Michelson Mark Oskin University of Washington Sponsored by NSF and Intel

Things to keep you up at night (~2016) � Opportunities � 8 billion transistors; 28Ghz � One DRAM chip will exhaust a 32bit address space Chips are � 120 P4s OR 200,000 RISC-1 will fit on networks a die. � Challenges � It will take 36 cycles to cross the die. Fault tolerance � For reasonable yields, only 1 is required transistor in 24 billion may be broken, if one flaw breaks a chip. Simpler designs � 7yrs and 10000 people Better tools 2

Outline � Monolithic von Neumann processing � WaveScalar � Results � Future work and Conclusions 3

Monolithic Processing � Von Neumann is simple. � We know how to build them. � 2016? � Communication � Fault tolerance � Complexity � Performance 4

Decentralized Processing ☺ Communication ☺ Fault tolerance ☺ Complexity ☺ Performance 5

The Problem with Von Neumann � Fundamentally centralized. � Fetch is the key. � There is only one program counter. � There is no parallelism in the model. � The alternative is dataflow 6

Dataflow has been done before.. � Dataflow is not new � Operations fire when data is available � No program counter � No false control dependencies � Exposes massive parallelism � But... 7

...it had issues � It never executed mainstream code � Special languages � No mutable data structures � No aliasing � Functional � Strange memory semantics � There are scalability concerns � Large slow token stores 8

The WaveScalar Model � WaveScalar is memory-centric Dataflow � Compared to Von Neumann � There is no fetch. � Compared to traditional dataflow. � Memory ordering is a first-class citizen. � Normal memory semantics. � No I-structures or special languages. � We run Spec. 9

What is a wave? � Maximal loop-free sections of the dataflow graph. � May contain branches and joins. � They are bigger than hyperblocks. � Each dynamic wave has a Wave number. � Every value has a wave number. 10

Maintaining Memory Order � Loads and stores can issue requests to memory in any order: � Wave number. � Operation sequence number. � Ordering information (predecessor and successor sequence numbers). � The memory systems reconstructs the correct order. � Wave number+sequence numbers provide a total order � Your favorite speculative memory system. � Or a store buffer. 11

WaveScalar benefits � Expose everything about the program � Data dependencies � Memory order � Instructions manipulate wave numbers. � Multiple, parallel sequences of operations are possible. � Synchronization � Concurrency � Communication 1 2

The WaveCache The I- -Cache Cache The I L2 Cache the is the is processor. processor. 1 3

WaveCache Processing Element Cluster � Long distance PE Domain communication FLOW CONTROL � Dynamic routing � Grid-based network INPUTS L2 Cache � 1-2 cycle/domain. D$ + Store Buffer � Traditional cache FU DECODE coherence. CONFIG. OUTPUTS LOGIC � Normal memory hierarchy. FLOW CONTROL � 16K instructions. 1 4

Current results � Compiled SPEC/mediabench � DEC cc compiler (-O4 -unroll 16) � Binary translator/compiler � From Alpha AXP � to WaveScalar � Timing/execution-based simulation. � Results in alpha instructions per cycle (AIPC) 1 5

Comparison architectures � Superscalar � 16-wide, 16 ported cache, 1024 issue window, 1024 regs, gshare branch predictor � 15 stage pipeline. � Perfect cache. � WaveCache � ~2000 processing elements � 16 elements/domain � Perfect cache. 1 6

WaveScalar vs Superscalar 6 � 2.8x faster 5 � Not counting clock 4 rate improvements. AIPC WaveScalar 3 superscalar 2 1 0 vpr tw olf mcf equake art adpcm mpeg fft 1 7

Cache replacement � Not all the instructions will fit. � WaveCache miss � Destination instruction is not present � Evict/Load an instruction (flush/load queues) � Instructions volunteer for removal � Location is important � Normal hashing won’t work 1 8

Cache size 1.2 � Thrashing is dangerous 1 � Dynamic mapping vpr 0.8 Normalized performance tw olf is a big win. mcf equake 0.6 art adpcm mpeg 0.4 fft 0.2 0 10 100 1000 10000 Cache size (log) 1 9

Speculation 10 9 � Speculation 8 helps 7 6 � 2.4x on WaveScalar AIPC Perfect branch 5 Perfect mem. Disambig average for Both 4 both 3 � This is gravy!! 2 1 0 vpr tw olf mcf equake art adpcm mpeg fft 20

Future work � Hardware implementation � A la the Bathysphere � Compiler issues � Memory parallelism � More than von Neumann emulation � Vector � Streaming � WaveScalar is an ISA for writing architectures. � Operating system issues � What is a context switch? � What is a system call? 21

Future work � Online placement optimization � Simulated annealing � Defect tolerance � Hard and soft faults � WaveCache as a computer system. � WaveScalar everything (graphics, IO, CPU, Keyboard, hard drive) � Uniform namespace for a computer. � Adaptation at load time. 22

Conclusion � Decentralized computing will let you rest easy in 2016! � WaveScalar and the WaveCache � Dataflow with normal memory!! � Outperforms an OOO superscalar by 2.8x � Feasible now and in 2016 � Enormous opportunities for future research 2 3

Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin - PowerPoint PPT Presentation

Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin Ken Michelson Mark Oskin University of Washington Sponsored by NSF and Intel Things to keep you up at night (~2016) Opportunities 8 billion transistors; 28Ghz One

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Wavescalar Assembly: Dataflow Winter 2006 CSE 548 - Dataflow Machines 1 Wavescalar Assembly:

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline

CO444H Dataflow Dataflow frameworks Ben Livshits Masters Projects Available 1. Crashes to

Oversampling in a Dataflow Synchronous Language (Heptagon) erard 1 L eonard G 1 PARKAS team

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Dataflow Execution Dataflow Execution Craig Knoblock University of Southern California This

Diving Deeper: How to intervene in infrastructure projects February 6, 2017 Agenda How to

Market Imperfections (Welch, Chapter 11) Ivo Welch UCLA Anderson School, Corporate Finance,

Beyond SM and Heavy Flavor Physics Results from Tevatron Introduction Collider experiments

Pushing Left, Like a Boss Application Security Foundations Tanya Janca Tanya.Janca@owasp.org

Fuzzy Logic Fuzzy Logic Aristotle: A or (xor) not(A) Buddha: A and not(A) Example:

The Mystery of God 1. Who is this God? 2. Where is this God 3. What does He want from me

- Engagement update 25 June 2019 For community use only Version One Informing Regular drop-in

4. Geography 4.1 Terms and Maps 4.2 Physical Geography 4.3 Political Organization of Space

Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin - PowerPoint PPT Presentation

Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin Ken Michelson Mark Oskin University of Washington Sponsored by NSF and Intel Things to keep you up at night (~2016) Opportunities 8 billion transistors; 28Ghz One

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Wavescalar Assembly: Dataflow Winter 2006 CSE 548 - Dataflow Machines 1 Wavescalar Assembly:

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline

CO444H Dataflow Dataflow frameworks Ben Livshits Masters Projects Available 1. Crashes to

Oversampling in a Dataflow Synchronous Language (Heptagon) erard 1 L eonard G 1 PARKAS team

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Dataflow Execution Dataflow Execution Craig Knoblock University of Southern California This

Diving Deeper: How to intervene in infrastructure projects February 6, 2017 Agenda How to

Market Imperfections (Welch, Chapter 11) Ivo Welch UCLA Anderson School, Corporate Finance,

Beyond SM and Heavy Flavor Physics Results from Tevatron Introduction Collider experiments

Pushing Left, Like a Boss Application Security Foundations Tanya Janca Tanya.Janca@owasp.org

Fuzzy Logic Fuzzy Logic Aristotle: A or (xor) not(A) Buddha: A and not(A) Example:

The Mystery of God 1. Who is this God? 2. Where is this God 3. What does He want from me

- Engagement update 25 June 2019 For community use only Version One Informing Regular drop-in

4. Geography 4.1 Terms and Maps 4.2 Physical Geography 4.3 Political Organization of Space

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed