A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix - PowerPoint PPT Presentation

A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix Alg lgorithms Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki University of California, Los Angeles Feb 27 th , 2020 1

Inductive Kernels F( )= 12..32 • Inductive Kernels • QR, SVD, Cholesky, Solver F( )= • Challenges F( )= • Poor vectorization • Too small to be multi-threaded 2

General Purpose Processor Performance 25% 25% % of the ideal performance dsp cpu gpu 20% 15% 10% 5% 0% 3

Our Goal: An Efficient, Flexible, and Specialized Architecture • Base Design: Spatial Architecture Ct Ctrl Shared Spad • Specialize Inductive Idioms rl • Hybridizing PEs, Inductive Control, Ct Ct Private Private and Implicit Vector Padding rl rl • Scale up with multiple lanes Spad Spad • REVEL: Re configurable Ve ctor Sync Sync Ctrl Ctrl L anes • 3.4x, and 17x speedup over existing XFER XFER spatial architectures, and general purpose processors • 2x power and half area comparing Sync Sync with ASIC collection 4

Outline • Background • “Dataflow” or “Systolic”? A tradeoff between cost and flexibility • Challenge 1: Synchronous Coordination • Challenge 2: Overwhelmed Processing Elements • Challenge 3: Overwhelmed Coordination • Inductive Access • Padding the Vectorization • REVEL: Reconfigurable Vector Lanes • Evaluation 5

Spatial Architecture • Architectures expose computing resource and on-chip network to programming interfaces • Each PE is dedicated to • Multiple instructions one instruction shared PE • The timing of data arrival • Dynamically scheduled are determined when execution compilation Dependence Systolic Tagged Dataflow Graph ＋＋ ✖︐ ✖︐ ✖︐ ＋ >> ✖︐ ✖︐ 5.8x Area 4.2x Power ➖ ➖ >> ✖︐ ➖ >> 6

Spatial Architecture Performance 80% systolic tagged dataflow % of the ideal performance 70% 60% 50% 40% 30% 20% 10% 0% 7

Base Design: Systolic Architecture with Decoupled Data Access • Arithmetic operations are offloaded onto Ct Scratch Ctrl rl spatial architecture Memory • Data access are decoupled and coordinated by the controller Sync Ctrl ＋ • Synchronization buffers serve as ✖︐ ✖︐ XFER interfaces between dynamic/static ➖ >> timing Sync 8

Challenge 1: Non-uniform Produce/Consume Rate produce|consume Ct Scratch Ctrl 1|(n-j-1) rl Memory for (j=0; j<n; ++j) { ➗ x[j] = x[j]/a[j,j]; Sync Ctrl for (i=j+1; i<n; ++i) XFER x[i] -= x[j]*a[j,i]; ➖ ✖︐ } Sync 9

Challenge 2: Overwhelmed Processing Elements for (k=0; k<n; ++k) { inv = 1.0/a[k,k]; ➗ Ct Scratch Ctrl invsqrt = 1.0/sqrt(a[k,k]); ➗ rl √ Memory for (j=k; j<n; ++j) ✖︐ l[j,k] = a[k,j]*invsqrt; Sync Ctrl for (j=k+1; j<n; ++j) XFER for (i=j; i<n; ++i) - a[j,i] -= a[k,i]*a[k,j]*inv; ✖︐ ✖︐ Sync } 10 10

Challenge 3.1: Overwhelmed Coordination for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; i+=2) { Ct Scratch Ctrl x[i] -= x[j]*a[j,i]; rl Memory }} Rectangular Slicing: a[n:m,p:q] Sync Ctrl Triangular Slicing: ➗ ✖︐ a[j+1:n] XFER a[j+2:n] ➖ a[j+3:n] Sync … a[n-2:n] 11

Challenge 3.2: Imperfect Loop Tiling for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; i+=2) { Ct Scratch Ctrl x[i] -= x[j]*a[j,i]; rl Memory if (i+1<n) x[i+1] -= x[j]*a[j,i+1]; }} Rectangular Slicing: a[n:m,p:q] Sync Ctrl Triangular Slicing: ➗ ✖︐ ✖︐ a[j+1:n] XFER a[j+2:n] ➖ ➖ a[j+3:n] Sync … a[n-2:n] 12

Outline • Spatial Architecture • REVEL: Reconfigurable Vector Lane • Specialization 1: Hybridizing processing elements • Specialization 2: Coordinating non-uniform dependences • Specialization 3: Inductive Access Intrinsics • Specialization 4: Implicit vectorization predication • Scalability: Larger Spatial or Multiple Lanes? • Evaluation 13

Specialization 1: Hybridizing Systolic and Dataflow for (k=0; k<n; ++k) { inv = 1.0/a[k,k]; ➗ O(1) Ct Scratch Ctrl invsqrt = 1.0/sqrt(a[k,k]); ➗ rl √ Memory for (j=k; j<n; ++j) O(n) ✖︐ l[j,k] = a[k,j]*invsqrt; Sync Ctrl for (j=k+1; j<n; ++j) O(n²) XFER for (i=j; i<n; ++i) - a[j,i] -= a[k,i]*a[k,j]*inv; ✖︐ ✖︐ Sync } 14 14

Specialization 2: Coordinating Non-uniform Dependences 1|(n-j-1) for (j=0; j<n; ++j) { Ct Scratch Ctrl rl x[j] = x[j]/a[j,j]; Memory for (i=j+1; i<n; ++i) 1|(n-j-1) Sync x[i] -= x[j]*a[j,i]; Ctrl ➗ ✖︐ } XFER ➖ Sync 15

Specialization 3: Inductive Memory Access for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; i+=2) { Ct Scratch Ctrl x[i] -= x[j]*a[j,i]; rl Memory if (i+1<n) x[i+1] -= x[j]*a[j,i+1]; }} Rectangular Slicing: a[n:m,p:q] Sync Ctrl Triangular Slicing: 𝑜 ➗ ✖︐ 𝑘 𝑙+1 a[k,j:n] a[j+1:n] XFER a[j+2:n] ➖ a[j+3:n] Sync … a[n-2:n] 16

Specialization 4: Implicit Vectorization Predication for (j=0; j<n; ++j) { Generate masks according to the x[j] = x[j]/a[j,j]; striding pattern for (i=j+1; i<n; i+=2) { Ct Scratch Ctrl x[i] -= x[j]*a[j,i]; rl Memory if (i+1<n) x[i+1] -= x[j]*a[j,i+1]; }} Sync Ctrl Triangular Slicing: a[j+1:n] ➗ ✖︐ ✖︐ XFER a[j+2:n] ➖ ➖ a[j+3:n] … Sync a[n-2:n] 17

Scalability: A larger mesh or multiple lanes? Ctrl Ctrl Ctrl Scratch Memory Ct Ct Private Private Pri rl rl Sync Spad Spad Spad Ctrl Sync Sync Ctrl Ctrl XFER XFER XFER Sync Sync Sync 18

REVEL: Re Reconfigurable Ve Vector Lanes Ct Ctrl Shared Spad • A centralized control core rl 01010101 a[k,j:n] commands multiple lanes 0 a[0,j:n] 1 a[1,j:n] a[2 0 Ct Ct Private Private Pri • Predication based broadcast rl rl Spad Spad Spad • SIMT-like control commands • Each lane is independent to Sync Sync Ctrl Ctrl execute XFER XFER Sync Sync Lane 0 Lane 1 19

Outline • Spatial Architecture • REVEL: Reconfigurable Vector Lanes • Evaluation • Methodology • Speedup 20

Evaluation Methodology • Performance • Gem5 RISCV in-order core integrated with a cycle-accurate spatial architecture simulator • Extending the stream-dataflow ISA • Baselines: • Intel(R) Xeon(R) Silver 4116 @2.10GHz (Intel MKL) Same peak • TI6678 DSP @1.25GHz (TI DSPLIB) performance • NVIDIA Titan (cuBlas) 0.04 CGRA Net • Power/Area 0.24 Trig Net 0.48 • Spatial Architecture implemented in Chisel FUs 0.48 SPAD • Synthesized in Synopsys DC 28nm @1.25GHz 0.16 VP/SE • SRAM power/area are estimated by CACTI. 0.56 Control Core 21

Speedup over the TI DSP (log scale) Speedup (Batch-8) 100 cpu gpu systolic tagged revel 10 1 0.1 0.01 22

Speedup (Batch-1) Speedup Over the TI DSP (log scale) 100 cpu gpu systolic dataflow revel 10 1 0.1 0.01 23

Conclusion • According to our results, REVEL and its hybrid architecture is a promising next-generation digital signal processing architecture. • More broadly, our work demonstrates the importance of considering multiple execution models. 24

A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix - PowerPoint PPT Presentation

A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix Alg lgorithms Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki University of California, Los Angeles Feb 27 th , 2020 1 Inductive Kernels F(

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

VLSI programming Systolic Design Book Parhi, Chp. 7 Rudolf Mak r.h.mak@tue.nl 18-May-16

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Contemporary Management of Diabetic Diabetes Cardiomyopathy Systolic Heart Failure Obesity

Cross- -sectional Association of Job Strain and Systolic sectional Association of Job Strain and

On the explicit systolic inequality from the cup-product Hoil Ryu Graduate School of

Dont Use a Single Large Systolic Array, Use Many Small Ones Instead H. T. Kung Harvard

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

New ew Fa Faculty lty Li Life fe Hac acks: ks: So youre at a PUI Michelle L. Kovarik

Bootstrapping for Approximate Homomorphic Encryption Jung Hee Cheon, Kyoohyung Han, Andrey Kim

Dissecting Design Choices for Power Efficient Continuous-time DS Converters Shanthi Pavan Indian

H ( z ) x ( t ) CT DT DT CT y ( t ) X ( j ) X (e j ) X i ( j ) T s T s

Museums Beyond Reopening: Thriving in Your New Normal Welcome! The webinar will begin at 10:00

Traceback for End-to-End Encrypted Messaging Nirvan Tyagi Ian Miers Tom Ristenpart CCS 2019 1

Functional Encryption: Deterministic to Randomized Functions from Simple Assumptions Shashank

Introducing Computational Thinking in Education Courses Aman Yadav Chris Mayfield Ninger Zhou

Sambuz

Useful Links

Newsletter

Mail Us

A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix - PowerPoint PPT Presentation

A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix Alg lgorithms Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki University of California, Los Angeles Feb 27 th , 2020 1 Inductive Kernels F(

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

VLSI programming Systolic Design Book Parhi, Chp. 7 Rudolf Mak r.h.mak@tue.nl 18-May-16

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Contemporary Management of Diabetic Diabetes Cardiomyopathy Systolic Heart Failure Obesity

Cross- -sectional Association of Job Strain and Systolic sectional Association of Job Strain and

On the explicit systolic inequality from the cup-product Hoil Ryu Graduate School of

Dont Use a Single Large Systolic Array, Use Many Small Ones Instead H. T. Kung Harvard

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

New ew Fa Faculty lty Li Life fe Hac acks: ks: So youre at a PUI Michelle L. Kovarik

Bootstrapping for Approximate Homomorphic Encryption Jung Hee Cheon, Kyoohyung Han, Andrey Kim

Dissecting Design Choices for Power Efficient Continuous-time DS Converters Shanthi Pavan Indian

H ( z ) x ( t ) CT DT DT CT y ( t ) X ( j ) X (e j ) X i ( j ) T s T s

Museums Beyond Reopening: Thriving in Your New Normal Welcome! The webinar will begin at 10:00

Traceback for End-to-End Encrypted Messaging Nirvan Tyagi Ian Miers Tom Ristenpart CCS 2019 1

Functional Encryption: Deterministic to Randomized Functions from Simple Assumptions Shashank

Introducing Computational Thinking in Education Courses Aman Yadav Chris Mayfield Ninger Zhou

Sambuz

Useful Links

Newsletter

Mail Us

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed