a hybrid systolic dataflow architecture for
play

A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix - PowerPoint PPT Presentation

A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix Alg lgorithms Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki University of California, Los Angeles Feb 27 th , 2020 1 Inductive Kernels F(


  1. A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix Alg lgorithms Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki University of California, Los Angeles Feb 27 th , 2020 1

  2. Inductive Kernels F( )= 12..32 • Inductive Kernels • QR, SVD, Cholesky, Solver F( )= • Challenges F( )= • Poor vectorization • Too small to be multi-threaded 2

  3. General Purpose Processor Performance 25% 25% % of the ideal performance dsp cpu gpu 20% 15% 10% 5% 0% 3

  4. Our Goal: An Efficient, Flexible, and Specialized Architecture • Base Design: Spatial Architecture Ct Ctrl Shared Spad • Specialize Inductive Idioms rl • Hybridizing PEs, Inductive Control, Ct Ct Private Private and Implicit Vector Padding rl rl • Scale up with multiple lanes Spad Spad • REVEL: Re configurable Ve ctor Sync Sync Ctrl Ctrl L anes • 3.4x, and 17x speedup over existing XFER XFER spatial architectures, and general purpose processors • 2x power and half area comparing Sync Sync with ASIC collection 4

  5. Outline • Background • “Dataflow” or “Systolic”? A tradeoff between cost and flexibility • Challenge 1: Synchronous Coordination • Challenge 2: Overwhelmed Processing Elements • Challenge 3: Overwhelmed Coordination • Inductive Access • Padding the Vectorization • REVEL: Reconfigurable Vector Lanes • Evaluation 5

  6. Spatial Architecture • Architectures expose computing resource and on-chip network to programming interfaces • Each PE is dedicated to • Multiple instructions one instruction shared PE • The timing of data arrival • Dynamically scheduled are determined when execution compilation Dependence Systolic Tagged Dataflow Graph + + ✖︐ ✖︐ ✖︐ + >> ✖︐ ✖︐ 5.8x Area 4.2x Power ➖ ➖ >> ✖︐ ➖ >> 6

  7. Spatial Architecture Performance 80% systolic tagged dataflow % of the ideal performance 70% 60% 50% 40% 30% 20% 10% 0% 7

  8. Base Design: Systolic Architecture with Decoupled Data Access • Arithmetic operations are offloaded onto Ct Scratch Ctrl rl spatial architecture Memory • Data access are decoupled and coordinated by the controller Sync Ctrl + • Synchronization buffers serve as ✖︐ ✖︐ XFER interfaces between dynamic/static ➖ >> timing Sync 8

  9. Challenge 1: Non-uniform Produce/Consume Rate produce|consume Ct Scratch Ctrl 1|(n-j-1) rl Memory for (j=0; j<n; ++j) { ➗ x[j] = x[j]/a[j,j]; Sync Ctrl for (i=j+1; i<n; ++i) XFER x[i] -= x[j]*a[j,i]; ➖ ✖︐ } Sync 9

  10. Challenge 2: Overwhelmed Processing Elements for (k=0; k<n; ++k) { inv = 1.0/a[k,k]; ➗ Ct Scratch Ctrl invsqrt = 1.0/sqrt(a[k,k]); ➗ rl √ Memory for (j=k; j<n; ++j) ✖︐ l[j,k] = a[k,j]*invsqrt; Sync Ctrl for (j=k+1; j<n; ++j) XFER for (i=j; i<n; ++i) - a[j,i] -= a[k,i]*a[k,j]*inv; ✖︐ ✖︐ Sync } 10 10

  11. Challenge 3.1: Overwhelmed Coordination for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; i+=2) { Ct Scratch Ctrl x[i] -= x[j]*a[j,i]; rl Memory }} Rectangular Slicing: a[n:m,p:q] Sync Ctrl Triangular Slicing: ➗ ✖︐ a[j+1:n] XFER a[j+2:n] ➖ a[j+3:n] Sync … a[n-2:n] 11

  12. Challenge 3.2: Imperfect Loop Tiling for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; i+=2) { Ct Scratch Ctrl x[i] -= x[j]*a[j,i]; rl Memory if (i+1<n) x[i+1] -= x[j]*a[j,i+1]; }} Rectangular Slicing: a[n:m,p:q] Sync Ctrl Triangular Slicing: ➗ ✖︐ ✖︐ a[j+1:n] XFER a[j+2:n] ➖ ➖ a[j+3:n] Sync … a[n-2:n] 12

  13. Outline • Spatial Architecture • REVEL: Reconfigurable Vector Lane • Specialization 1: Hybridizing processing elements • Specialization 2: Coordinating non-uniform dependences • Specialization 3: Inductive Access Intrinsics • Specialization 4: Implicit vectorization predication • Scalability: Larger Spatial or Multiple Lanes? • Evaluation 13

  14. Specialization 1: Hybridizing Systolic and Dataflow for (k=0; k<n; ++k) { inv = 1.0/a[k,k]; ➗ O(1) Ct Scratch Ctrl invsqrt = 1.0/sqrt(a[k,k]); ➗ rl √ Memory for (j=k; j<n; ++j) O(n) ✖︐ l[j,k] = a[k,j]*invsqrt; Sync Ctrl for (j=k+1; j<n; ++j) O(n²) XFER for (i=j; i<n; ++i) - a[j,i] -= a[k,i]*a[k,j]*inv; ✖︐ ✖︐ Sync } 14 14

  15. Specialization 2: Coordinating Non-uniform Dependences 1|(n-j-1) for (j=0; j<n; ++j) { Ct Scratch Ctrl rl x[j] = x[j]/a[j,j]; Memory for (i=j+1; i<n; ++i) 1|(n-j-1) Sync x[i] -= x[j]*a[j,i]; Ctrl ➗ ✖︐ } XFER ➖ Sync 15

  16. Specialization 3: Inductive Memory Access for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; i+=2) { Ct Scratch Ctrl x[i] -= x[j]*a[j,i]; rl Memory if (i+1<n) x[i+1] -= x[j]*a[j,i+1]; }} Rectangular Slicing: a[n:m,p:q] Sync Ctrl Triangular Slicing: 𝑜 ➗ ✖︐ 𝑘 𝑙+1 a[k,j:n] a[j+1:n] XFER a[j+2:n] ➖ a[j+3:n] Sync … a[n-2:n] 16

  17. Specialization 4: Implicit Vectorization Predication for (j=0; j<n; ++j) { Generate masks according to the x[j] = x[j]/a[j,j]; striding pattern for (i=j+1; i<n; i+=2) { Ct Scratch Ctrl x[i] -= x[j]*a[j,i]; rl Memory if (i+1<n) x[i+1] -= x[j]*a[j,i+1]; }} Sync Ctrl Triangular Slicing: a[j+1:n] ➗ ✖︐ ✖︐ XFER a[j+2:n] ➖ ➖ a[j+3:n] … Sync a[n-2:n] 17

  18. Scalability: A larger mesh or multiple lanes? Ctrl Ctrl Ctrl Scratch Memory Ct Ct Private Private Pri rl rl Sync Spad Spad Spad Ctrl Sync Sync Ctrl Ctrl XFER XFER XFER Sync Sync Sync 18

  19. REVEL: Re Reconfigurable Ve Vector Lanes Ct Ctrl Shared Spad • A centralized control core rl 01010101 a[k,j:n] commands multiple lanes 0 a[0,j:n] 1 a[1,j:n] a[2 0 Ct Ct Private Private Pri • Predication based broadcast rl rl Spad Spad Spad • SIMT-like control commands • Each lane is independent to Sync Sync Ctrl Ctrl execute XFER XFER Sync Sync Lane 0 Lane 1 19

  20. Outline • Spatial Architecture • REVEL: Reconfigurable Vector Lanes • Evaluation • Methodology • Speedup 20

  21. Evaluation Methodology • Performance • Gem5 RISCV in-order core integrated with a cycle-accurate spatial architecture simulator • Extending the stream-dataflow ISA • Baselines: • Intel(R) Xeon(R) Silver 4116 @2.10GHz (Intel MKL) Same peak • TI6678 DSP @1.25GHz (TI DSPLIB) performance • NVIDIA Titan (cuBlas) 0.04 CGRA Net • Power/Area 0.24 Trig Net 0.48 • Spatial Architecture implemented in Chisel FUs 0.48 SPAD • Synthesized in Synopsys DC 28nm @1.25GHz 0.16 VP/SE • SRAM power/area are estimated by CACTI. 0.56 Control Core 21

  22. Speedup over the TI DSP (log scale) Speedup (Batch-8) 100 cpu gpu systolic tagged revel 10 1 0.1 0.01 22

  23. Speedup (Batch-1) Speedup Over the TI DSP (log scale) 100 cpu gpu systolic dataflow revel 10 1 0.1 0.01 23

  24. Conclusion • According to our results, REVEL and its hybrid architecture is a promising next-generation digital signal processing architecture. • More broadly, our work demonstrates the importance of considering multiple execution models. 24

Recommend


More recommend