WaveScalar Dataflow machine good at exploiting ILP dataflow - PowerPoint PPT Presentation

WaveScalar Dataflow machine • good at exploiting ILP • dataflow parallelism + traditional coarser-grain parallelism • cheap thread management • memory ordering enforced through wave-ordered memory Winter 2006 CSE 548 - WaveScalar 1

WaveScalar Motivation: • increasing disparity between computation (fast transistors) & communication (long wires) • increasing circuit complexity • decreasing fabrication reliability Winter 2006 CSE 548 - WaveScalar 2

Monolithic von Neumann Processors A phenomenal success today. But in 2016?  Performance Centralized processing & control, e.g., operand broadcast networks  Complexity 40-75% of “design” time is design verification  Defect tolerance 1 flaw -> paperweight Winter 2006 CSE 548 - WaveScalar 3

WaveScalar Executive Summary Distributed microarchitecture  • hundreds of PEs • dataflow execution – no centralized control • short point-to-point communication • organized hierarchically for fast communication between neighboring PEs • defect tolerance – route around a bad PE Low design complexity through simple, identical PEs  • design one & stamp out thousands Winter 2006 CSE 548 - WaveScalar 4

Processing Element distributed tag matching 2 PEs in a pod Winter 2006 CSE 548 - WaveScalar 5

Domain Winter 2006 CSE 548 - WaveScalar 6

Cluster Winter 2006 CSE 548 - WaveScalar 7

Whole Chip • Can hold 32K instructions • Long distance communication • Dynamic routing • Grid-based network • 2-cycle hop/cluster • Normal memory hierarchy • Traditional directory-based cache coherence Winter 2006 CSE 548 - WaveScalar 8

WaveScalar Execution Model Dataflow Place instructions in PEs to maximize data locality & instruction-level parallelism. • Instruction placement algorithm based on a performance model that captures the conflicting goals • Depth-first traversal of dataflow graph to make chains of dependent instructions • Broken into segments • Snakes segments across the chip on demand • K-loop bounding to prevent instruction “explosion” Instructions communicate values directly (point-to-point). Winter 2006 CSE 548 - WaveScalar 9

WaveScalar Instruction Placement Winter 2006 CSE 548 - WaveScalar 10

WaveScalar Example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Store b Winter 2006 CSE 548 - WaveScalar 11

WaveScalar Example i A j * * A[j + i*i] = i; + + b = A[i*j]; Load + Global load-store ordering issue Store b Winter 2006 CSE 548 - WaveScalar 16

Wave-ordered Memory Load 2 3 4 • Compiler annotates memory Store 3 4 ? operations  Sequence #  Successor 4 5 6 Store  Predecessor Load 4 7 8 Load 5 6 8 • Send memory requests in any order • Hardware reconstructs the correct order Store ? 8 9 Winter 2006 CSE 548 - WaveScalar 17

Wave-ordering Example Store buffer Load 2 3 4 2 3 4 Store 3 4 ? 3 4 ? 4 5 6 Store Load 4 7 8 Load 5 6 8 4 7 8 ? 8 9 Store ? 8 9 Winter 2006 CSE 548 - WaveScalar 18

Wave-ordered Memory Waves are loop-free sections of the dataflow graph Each dynamic wave has a wave number Wave number is incremented between waves Ordering memory: • wave-numbers • sequence number within a wave Winter 2006 CSE 548 - WaveScalar 19

WaveScalar Tag-matching WaveScalar tag • thread identifier <2:5>.3 <2:5>.6 • wave number Token: tag & value + <ThreadID:Wave#> . value <2:5>.9 Winter 2006 CSE 548 - WaveScalar 20

Single-thread Performance Performance per area 0.05 0.04 2 AIPC/mm 0.03 WS OOO 0.02 0.01 0 ammp art equake gzip mcf twolf djpeg mpeg2encode rawdaudio average Winter 2006 CSE 548 - WaveScalar 21

Multithreading the WaveCache Architectural-support for WaveScalar threads • instructions to start & stop memory orderings, i.e., threads • memory-free synchronization to allow exclusive access to data (TC) • fence instruction to allow other threads to see this one ’ s memory ops Combine to build threads with multiple granularities • coarse-grain threads: 25-168X over a single thread; 2-16X over CMP, 5-11X over SMT • fine-grain, dataflow-style threads: 18-242X over single thread • combine the two in the same application: 1.6X or 7.9X -> 9X Winter 2006 CSE 548 - WaveScalar 22

Creating & Terminating a Thread Winter 2006 CSE 548 - WaveScalar 23

Thread Creation Overhead Winter 2006 CSE 548 - WaveScalar 24

Performance of Coarse-grain Parallelism Winter 2006 CSE 548 - WaveScalar 25

Performance of Fine-grain Parallelism Winter 2006 CSE 548 - WaveScalar 26

Building the WaveCache RTL-level implementation • some didn ’ t believe it could be built in a normal-sized chip • some didn ’ t believe it could achieve a decent cycle time and load- use latencies • Verilog & Synopsis CAD tools Different WaveCache ’ s for different applications • 1 cluster: low-cost, low power, single-thread or embedded • 52 mm 2 in 90 nm process technology, 3.5 AIPC on Splash2 16 clusters: multiple threads, higher performance: 436 mm 2 , 15 • AIPC board-level FPGA implementation • OS & real application simulations Winter 2006 CSE 548 - WaveScalar 27

WaveScalar Dataflow machine good at exploiting ILP dataflow - PowerPoint PPT Presentation

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE 548 - WaveScalar 1

Wavescalar Assembly: Dataflow Winter 2006 CSE 548 - Dataflow Machines 1 Wavescalar Assembly:

Dataflow & Tiled Architectures WaveScalar and TRIPS - Irene Lin & Kevin Rohan

WaveScalar Good old days 2 Good old days ended in Nov. 2002 Complexity Clock scaling

EFFECTS OF PAYMENTS FOR ECOSYSTEM SERVICES ON WILDLIFE IN FANJINGSHAN NATIONAL NATURE RESERVE,

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn Make it

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Mark Pagel pags@cray.com New features in XT MPT 3.1 and MPT 3.2 Features as a result of

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang,

More Power to the Future Uduak Akpanedet IEEE PES Day 2020 Ambassador MSc Electrical Power

How to Evaluate Efficient Deep Neural Network Approaches Vivienne Sze ( @eems_mit)

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Advanced use of OpenSHMEM 2 Outline

TABLA: A Framework for Accelerating Statistical Machine Learning Presenters: MeiXing Dong,

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

Bayesian Optimization of Composite Functions Ral Astudillo Cornell University Joint work

Primary 3 English Language Content Joy of Learning Unit Coverage Level Focuses

Models using Buses Chapter 10 Introduction Mesh Advantages Constant link length.

CS184c: Computer Architecture [Parallel and Multithreaded] Day 11: May10, 2001 Data Parallel

Do Public Employment Services Improve Employment Outcomes? Evidence from Colombia Clemente

Raising a Digital Twin, Avoiding the Terrible Twos John Meyers, Naval Air Warfare Center

A Computational Pragmatics for Weaseling An implementation in the RSA-framework Leander Vignero

In-Situ Visualization for Direct Numerical Simulation of Turbulent Combustion Hongfeng Yu Sandia

Foundations of Chemical Kinetics Lecture 10: Introduction to potential energy surfaces Marc R.

Mapping between Dependency Structures and Compositional Semantic Representations LREC 2010 Max

Global well-posedness of the primitive equations of oceanic and atmospheric dynamics Jinkai Li

WaveScalar Dataflow machine good at exploiting ILP dataflow - PowerPoint PPT Presentation

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE 548 - WaveScalar 1

Wavescalar Assembly: Dataflow Winter 2006 CSE 548 - Dataflow Machines 1 Wavescalar Assembly:

Dataflow &amp; Tiled Architectures WaveScalar and TRIPS - Irene Lin &amp; Kevin Rohan

WaveScalar Good old days 2 Good old days ended in Nov. 2002 Complexity Clock scaling

EFFECTS OF PAYMENTS FOR ECOSYSTEM SERVICES ON WILDLIFE IN FANJINGSHAN NATIONAL NATURE RESERVE,

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn Make it

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Mark Pagel pags@cray.com New features in XT MPT 3.1 and MPT 3.2 Features as a result of

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang,

More Power to the Future Uduak Akpanedet IEEE PES Day 2020 Ambassador MSc Electrical Power

How to Evaluate Efficient Deep Neural Network Approaches Vivienne Sze ( @eems_mit)

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Advanced use of OpenSHMEM 2 Outline

TABLA: A Framework for Accelerating Statistical Machine Learning Presenters: MeiXing Dong,

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

Bayesian Optimization of Composite Functions Ral Astudillo Cornell University Joint work

Primary 3 English Language Content Joy of Learning Unit Coverage Level Focuses

Models using Buses Chapter 10 Introduction Mesh Advantages Constant link length.

CS184c: Computer Architecture [Parallel and Multithreaded] Day 11: May10, 2001 Data Parallel

Do Public Employment Services Improve Employment Outcomes? Evidence from Colombia Clemente

Raising a Digital Twin, Avoiding the Terrible Twos John Meyers, Naval Air Warfare Center

A Computational Pragmatics for Weaseling An implementation in the RSA-framework Leander Vignero

In-Situ Visualization for Direct Numerical Simulation of Turbulent Combustion Hongfeng Yu Sandia

Foundations of Chemical Kinetics Lecture 10: Introduction to potential energy surfaces Marc R.

Mapping between Dependency Structures and Compositional Semantic Representations LREC 2010 Max

Global well-posedness of the primitive equations of oceanic and atmospheric dynamics Jinkai Li

Dataflow & Tiled Architectures WaveScalar and TRIPS - Irene Lin & Kevin Rohan