WaveScalar Dataflow machine • good at exploiting ILP • dataflow parallelism + traditional coarser-grain parallelism • cheap thread management • memory ordering enforced through wave-ordered memory Winter 2006 CSE 548 - WaveScalar 1
WaveScalar Motivation: • increasing disparity between computation (fast transistors) & communication (long wires) • increasing circuit complexity • decreasing fabrication reliability Winter 2006 CSE 548 - WaveScalar 2
Monolithic von Neumann Processors A phenomenal success today. But in 2016? Performance Centralized processing & control, e.g., operand broadcast networks Complexity 40-75% of “design” time is design verification Defect tolerance 1 flaw -> paperweight Winter 2006 CSE 548 - WaveScalar 3
WaveScalar Executive Summary Distributed microarchitecture • hundreds of PEs • dataflow execution – no centralized control • short point-to-point communication • organized hierarchically for fast communication between neighboring PEs • defect tolerance – route around a bad PE Low design complexity through simple, identical PEs • design one & stamp out thousands Winter 2006 CSE 548 - WaveScalar 4
Processing Element distributed tag matching 2 PEs in a pod Winter 2006 CSE 548 - WaveScalar 5
Domain Winter 2006 CSE 548 - WaveScalar 6
Cluster Winter 2006 CSE 548 - WaveScalar 7
Whole Chip • Can hold 32K instructions • Long distance communication • Dynamic routing • Grid-based network • 2-cycle hop/cluster • Normal memory hierarchy • Traditional directory-based cache coherence Winter 2006 CSE 548 - WaveScalar 8
WaveScalar Execution Model Dataflow Place instructions in PEs to maximize data locality & instruction-level parallelism. • Instruction placement algorithm based on a performance model that captures the conflicting goals • Depth-first traversal of dataflow graph to make chains of dependent instructions • Broken into segments • Snakes segments across the chip on demand • K-loop bounding to prevent instruction “explosion” Instructions communicate values directly (point-to-point). Winter 2006 CSE 548 - WaveScalar 9
WaveScalar Instruction Placement Winter 2006 CSE 548 - WaveScalar 10
WaveScalar Example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b Winter 2006 CSE 548 - WaveScalar 11
WaveScalar Example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b Winter 2006 CSE 548 - WaveScalar 12
WaveScalar Example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b Winter 2006 CSE 548 - WaveScalar 13
WaveScalar Example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b Winter 2006 CSE 548 - WaveScalar 14
WaveScalar Example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b Winter 2006 CSE 548 - WaveScalar 15
WaveScalar Example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Global load-store ordering issue Store b Winter 2006 CSE 548 - WaveScalar 16
Wave-ordered Memory Load 2 3 4 • Compiler annotates memory Store 3 4 ? operations Sequence # Successor 4 5 6 Store Predecessor Load 4 7 8 Load 5 6 8 • Send memory requests in any order • Hardware reconstructs the correct order Store ? 8 9 Winter 2006 CSE 548 - WaveScalar 17
Wave-ordering Example Store buffer Load 2 3 4 2 3 4 Store 3 4 ? 3 4 ? 4 5 6 Store Load 4 7 8 Load 5 6 8 4 7 8 ? 8 9 Store ? 8 9 Winter 2006 CSE 548 - WaveScalar 18
Wave-ordered Memory Waves are loop-free sections of the dataflow graph Each dynamic wave has a wave number Wave number is incremented between waves Ordering memory: • wave-numbers • sequence number within a wave Winter 2006 CSE 548 - WaveScalar 19
WaveScalar Tag-matching WaveScalar tag • thread identifier <2:5>.3 <2:5>.6 • wave number Token: tag & value + <ThreadID:Wave#> . value <2:5>.9 Winter 2006 CSE 548 - WaveScalar 20
Single-thread Performance Performance per area 0.05 0.04 2 AIPC/mm 0.03 WS OOO 0.02 0.01 0 ammp art equake gzip mcf twolf djpeg mpeg2encode rawdaudio average Winter 2006 CSE 548 - WaveScalar 21
Multithreading the WaveCache Architectural-support for WaveScalar threads • instructions to start & stop memory orderings, i.e., threads • memory-free synchronization to allow exclusive access to data (TC) • fence instruction to allow other threads to see this one ’ s memory ops Combine to build threads with multiple granularities • coarse-grain threads: 25-168X over a single thread; 2-16X over CMP, 5-11X over SMT • fine-grain, dataflow-style threads: 18-242X over single thread • combine the two in the same application: 1.6X or 7.9X -> 9X Winter 2006 CSE 548 - WaveScalar 22
Creating & Terminating a Thread Winter 2006 CSE 548 - WaveScalar 23
Thread Creation Overhead Winter 2006 CSE 548 - WaveScalar 24
Performance of Coarse-grain Parallelism Winter 2006 CSE 548 - WaveScalar 25
Performance of Fine-grain Parallelism Winter 2006 CSE 548 - WaveScalar 26
Building the WaveCache RTL-level implementation • some didn ’ t believe it could be built in a normal-sized chip • some didn ’ t believe it could achieve a decent cycle time and load- use latencies • Verilog & Synopsis CAD tools Different WaveCache ’ s for different applications • 1 cluster: low-cost, low power, single-thread or embedded • 52 mm 2 in 90 nm process technology, 3.5 AIPC on Splash2 16 clusters: multiple threads, higher performance: 436 mm 2 , 15 • AIPC board-level FPGA implementation • OS & real application simulations Winter 2006 CSE 548 - WaveScalar 27
Recommend
More recommend