Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin Ken Michelson Mark Oskin University of Washington Sponsored by NSF and Intel
Things to keep you up at night (~2016) � Opportunities � 8 billion transistors; 28Ghz � One DRAM chip will exhaust a 32bit address space Chips are � 120 P4s OR 200,000 RISC-1 will fit on networks a die. � Challenges � It will take 36 cycles to cross the die. Fault tolerance � For reasonable yields, only 1 is required transistor in 24 billion may be broken, if one flaw breaks a chip. Simpler designs � 7yrs and 10000 people Better tools 2
Outline � Monolithic von Neumann processing � WaveScalar � Results � Future work and Conclusions 3
Monolithic Processing � Von Neumann is simple. � We know how to build them. � 2016? � Communication � Fault tolerance � Complexity � Performance 4
Decentralized Processing ☺ Communication ☺ Fault tolerance ☺ Complexity ☺ Performance 5
The Problem with Von Neumann � Fundamentally centralized. � Fetch is the key. � There is only one program counter. � There is no parallelism in the model. � The alternative is dataflow 6
Dataflow has been done before.. � Dataflow is not new � Operations fire when data is available � No program counter � No false control dependencies � Exposes massive parallelism � But... 7
...it had issues � It never executed mainstream code � Special languages � No mutable data structures � No aliasing � Functional � Strange memory semantics � There are scalability concerns � Large slow token stores 8
The WaveScalar Model � WaveScalar is memory-centric Dataflow � Compared to Von Neumann � There is no fetch. � Compared to traditional dataflow. � Memory ordering is a first-class citizen. � Normal memory semantics. � No I-structures or special languages. � We run Spec. 9
What is a wave? � Maximal loop-free sections of the dataflow graph. � May contain branches and joins. � They are bigger than hyperblocks. � Each dynamic wave has a Wave number. � Every value has a wave number. 10
Maintaining Memory Order � Loads and stores can issue requests to memory in any order: � Wave number. � Operation sequence number. � Ordering information (predecessor and successor sequence numbers). � The memory systems reconstructs the correct order. � Wave number+sequence numbers provide a total order � Your favorite speculative memory system. � Or a store buffer. 11
WaveScalar benefits � Expose everything about the program � Data dependencies � Memory order � Instructions manipulate wave numbers. � Multiple, parallel sequences of operations are possible. � Synchronization � Concurrency � Communication 1 2
The WaveCache The I- -Cache Cache The I L2 Cache the is the is processor. processor. 1 3
WaveCache Processing Element Cluster � Long distance PE Domain communication FLOW CONTROL � Dynamic routing � Grid-based network INPUTS L2 Cache � 1-2 cycle/domain. D$ + Store Buffer � Traditional cache FU DECODE coherence. CONFIG. OUTPUTS LOGIC � Normal memory hierarchy. FLOW CONTROL � 16K instructions. 1 4
Current results � Compiled SPEC/mediabench � DEC cc compiler (-O4 -unroll 16) � Binary translator/compiler � From Alpha AXP � to WaveScalar � Timing/execution-based simulation. � Results in alpha instructions per cycle (AIPC) 1 5
Comparison architectures � Superscalar � 16-wide, 16 ported cache, 1024 issue window, 1024 regs, gshare branch predictor � 15 stage pipeline. � Perfect cache. � WaveCache � ~2000 processing elements � 16 elements/domain � Perfect cache. 1 6
WaveScalar vs Superscalar 6 � 2.8x faster 5 � Not counting clock 4 rate improvements. AIPC WaveScalar 3 superscalar 2 1 0 vpr tw olf mcf equake art adpcm mpeg fft 1 7
Cache replacement � Not all the instructions will fit. � WaveCache miss � Destination instruction is not present � Evict/Load an instruction (flush/load queues) � Instructions volunteer for removal � Location is important � Normal hashing won’t work 1 8
Cache size 1.2 � Thrashing is dangerous 1 � Dynamic mapping vpr 0.8 Normalized performance tw olf is a big win. mcf equake 0.6 art adpcm mpeg 0.4 fft 0.2 0 10 100 1000 10000 Cache size (log) 1 9
Speculation 10 9 � Speculation 8 helps 7 6 � 2.4x on WaveScalar AIPC Perfect branch 5 Perfect mem. Disambig average for Both 4 both 3 � This is gravy!! 2 1 0 vpr tw olf mcf equake art adpcm mpeg fft 20
Future work � Hardware implementation � A la the Bathysphere � Compiler issues � Memory parallelism � More than von Neumann emulation � Vector � Streaming � WaveScalar is an ISA for writing architectures. � Operating system issues � What is a context switch? � What is a system call? 21
Future work � Online placement optimization � Simulated annealing � Defect tolerance � Hard and soft faults � WaveCache as a computer system. � WaveScalar everything (graphics, IO, CPU, Keyboard, hard drive) � Uniform namespace for a computer. � Adaptation at load time. 22
Conclusion � Decentralized computing will let you rest easy in 2016! � WaveScalar and the WaveCache � Dataflow with normal memory!! � Outperforms an OOO superscalar by 2.8x � Feasible now and in 2016 � Enormous opportunities for future research 2 3
Recommend
More recommend