dataflow the road less complex
play

Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin - PowerPoint PPT Presentation

Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin Ken Michelson Mark Oskin University of Washington Sponsored by NSF and Intel Things to keep you up at night (~2016) Opportunities 8 billion transistors; 28Ghz One


  1. Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin Ken Michelson Mark Oskin University of Washington Sponsored by NSF and Intel

  2. Things to keep you up at night (~2016) � Opportunities � 8 billion transistors; 28Ghz � One DRAM chip will exhaust a 32bit address space Chips are � 120 P4s OR 200,000 RISC-1 will fit on networks a die. � Challenges � It will take 36 cycles to cross the die. Fault tolerance � For reasonable yields, only 1 is required transistor in 24 billion may be broken, if one flaw breaks a chip. Simpler designs � 7yrs and 10000 people Better tools 2

  3. Outline � Monolithic von Neumann processing � WaveScalar � Results � Future work and Conclusions 3

  4. Monolithic Processing � Von Neumann is simple. � We know how to build them. � 2016? � Communication � Fault tolerance � Complexity � Performance 4

  5. Decentralized Processing ☺ Communication ☺ Fault tolerance ☺ Complexity ☺ Performance 5

  6. The Problem with Von Neumann � Fundamentally centralized. � Fetch is the key. � There is only one program counter. � There is no parallelism in the model. � The alternative is dataflow 6

  7. Dataflow has been done before.. � Dataflow is not new � Operations fire when data is available � No program counter � No false control dependencies � Exposes massive parallelism � But... 7

  8. ...it had issues � It never executed mainstream code � Special languages � No mutable data structures � No aliasing � Functional � Strange memory semantics � There are scalability concerns � Large slow token stores 8

  9. The WaveScalar Model � WaveScalar is memory-centric Dataflow � Compared to Von Neumann � There is no fetch. � Compared to traditional dataflow. � Memory ordering is a first-class citizen. � Normal memory semantics. � No I-structures or special languages. � We run Spec. 9

  10. What is a wave? � Maximal loop-free sections of the dataflow graph. � May contain branches and joins. � They are bigger than hyperblocks. � Each dynamic wave has a Wave number. � Every value has a wave number. 10

  11. Maintaining Memory Order � Loads and stores can issue requests to memory in any order: � Wave number. � Operation sequence number. � Ordering information (predecessor and successor sequence numbers). � The memory systems reconstructs the correct order. � Wave number+sequence numbers provide a total order � Your favorite speculative memory system. � Or a store buffer. 11

  12. WaveScalar benefits � Expose everything about the program � Data dependencies � Memory order � Instructions manipulate wave numbers. � Multiple, parallel sequences of operations are possible. � Synchronization � Concurrency � Communication 1 2

  13. The WaveCache The I- -Cache Cache The I L2 Cache the is the is processor. processor. 1 3

  14. WaveCache Processing Element Cluster � Long distance PE Domain communication FLOW CONTROL � Dynamic routing � Grid-based network INPUTS L2 Cache � 1-2 cycle/domain. D$ + Store Buffer � Traditional cache FU DECODE coherence. CONFIG. OUTPUTS LOGIC � Normal memory hierarchy. FLOW CONTROL � 16K instructions. 1 4

  15. Current results � Compiled SPEC/mediabench � DEC cc compiler (-O4 -unroll 16) � Binary translator/compiler � From Alpha AXP � to WaveScalar � Timing/execution-based simulation. � Results in alpha instructions per cycle (AIPC) 1 5

  16. Comparison architectures � Superscalar � 16-wide, 16 ported cache, 1024 issue window, 1024 regs, gshare branch predictor � 15 stage pipeline. � Perfect cache. � WaveCache � ~2000 processing elements � 16 elements/domain � Perfect cache. 1 6

  17. WaveScalar vs Superscalar 6 � 2.8x faster 5 � Not counting clock 4 rate improvements. AIPC WaveScalar 3 superscalar 2 1 0 vpr tw olf mcf equake art adpcm mpeg fft 1 7

  18. Cache replacement � Not all the instructions will fit. � WaveCache miss � Destination instruction is not present � Evict/Load an instruction (flush/load queues) � Instructions volunteer for removal � Location is important � Normal hashing won’t work 1 8

  19. Cache size 1.2 � Thrashing is dangerous 1 � Dynamic mapping vpr 0.8 Normalized performance tw olf is a big win. mcf equake 0.6 art adpcm mpeg 0.4 fft 0.2 0 10 100 1000 10000 Cache size (log) 1 9

  20. Speculation 10 9 � Speculation 8 helps 7 6 � 2.4x on WaveScalar AIPC Perfect branch 5 Perfect mem. Disambig average for Both 4 both 3 � This is gravy!! 2 1 0 vpr tw olf mcf equake art adpcm mpeg fft 20

  21. Future work � Hardware implementation � A la the Bathysphere � Compiler issues � Memory parallelism � More than von Neumann emulation � Vector � Streaming � WaveScalar is an ISA for writing architectures. � Operating system issues � What is a context switch? � What is a system call? 21

  22. Future work � Online placement optimization � Simulated annealing � Defect tolerance � Hard and soft faults � WaveCache as a computer system. � WaveScalar everything (graphics, IO, CPU, Keyboard, hard drive) � Uniform namespace for a computer. � Adaptation at load time. 22

  23. Conclusion � Decentralized computing will let you rest easy in 2016! � WaveScalar and the WaveCache � Dataflow with normal memory!! � Outperforms an OOO superscalar by 2.8x � Feasible now and in 2016 � Enormous opportunities for future research 2 3

Recommend


More recommend