proposing a fast and scalable systolic array for matrix
play

Proposing a Fast and Scalable Systolic Array for Matrix - PowerPoint PPT Presentation

Proposing a Fast and Scalable Systolic Array for Matrix Multiplication Bahar Asgari , , Ra Ramyad Ha Hadidi, , Hy Hyesoon esoon Ki Kim Click to edit Master subtitle style Matrix Multiplication 2 Matrix multiplication is the key operation


  1. Proposing a Fast and Scalable Systolic Array for Matrix Multiplication Bahar Asgari , , Ra Ramyad Ha Hadidi, , Hy Hyesoon esoon Ki Kim Click to edit Master subtitle style

  2. Matrix Multiplication 2 Matrix multiplication is the key operation in many applications Example: convolution in neural networks H H C K C F F Convolution: F … F W K = * W F F W.H W.H F 2 .C Matrix F 2 .C = K Multiplication: K × Systolic arrays perform matrix multiplication that } Includes several similar operations (i.e., multiply and accumulation) } Captures high data reuse rate

  3. Systolic Arrays for Matrix Multiplication 3 } Non-stationary } None of the operands are stationary p B A n × m × B m × p = C n × p m m a MAC unit n A

  4. Systolic Arrays for Matrix Multiplication 4 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 1: } only processing } Time steps: 1 }

  5. Systolic Arrays for Matrix Multiplication 5 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 1 } only processing } Time steps: 2 }

  6. Systolic Arrays for Matrix Multiplication 6 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 1: } only processing } Time steps: 3 }

  7. Systolic Arrays for Matrix Multiplication 7 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 1: } only processing } Time steps: 4 }

  8. Systolic Arrays for Matrix Multiplication 8 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 1: } only processing } Time steps: 5 }

  9. Systolic Arrays for Matrix Multiplication 9 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 1: } only processing } Time steps: n + m

  10. Systolic Arrays for Matrix Multiplication 10 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 2: } processing and offloading } Time steps: n + m + 1 Phase 1

  11. Systolic Arrays for Matrix Multiplication 11 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 3: } only offloading } Time steps: n + m + p - 2 + 1 Phase 1 Phase 2 }

  12. Systolic Arrays for Matrix Multiplication 12 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 3: } only offloading } Time steps: n + m + p - 2 + 2 Phase 1 Phase 2 }

  13. Systolic Arrays for Matrix Multiplication 13 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 3: } only offloading } Time steps: n + m + p - 2 + n Phase 2 } Phase 1

  14. Systolic Arrays for Matrix Multiplication 14 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 3: } only offloading } Time steps: 2n + m + p - 2 }

  15. Systolic Arrays for Matrix Multiplication 15 } Stationary } One operand (here, B) is stationary p A n × m × B m × p = C n × p B m n a MAC unit A m

  16. Systolic Arrays for Matrix Multiplication 16 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 1: } only loading B } Time steps: 1

  17. Systolic Arrays for Matrix Multiplication 17 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 1: } only loading B } Time steps: m - 1

  18. Systolic Arrays for Matrix Multiplication 18 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 2: } loading B and processing } Time steps: m - 1 + 1 Phase 1

  19. Systolic Arrays for Matrix Multiplication 19 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 3: } only processing } Time steps: m + 1 Phase 1 &2

  20. Systolic Arrays for Matrix Multiplication 20 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 3: } only processing } Time steps: m + m - 1 Phase 1 &2

  21. Systolic Arrays for Matrix Multiplication 21 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } processing and offloading } Time steps: 2m - 1 + 1 Phase 1 &2&3

  22. Systolic Arrays for Matrix Multiplication 22 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } processing and offloading } Time steps: 2m - 1 + 2 Phase 1 &2&3

  23. Systolic Arrays for Matrix Multiplication 23 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } processing and offloading } Time steps: 2m - 1 + 3 Phase 1 &2&3

  24. Systolic Arrays for Matrix Multiplication 24 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } processing and offloading } Time steps: 2m - 1 + n + p - 2 Phase 1 &2&3

  25. Systolic Arrays for Matrix Multiplication 25 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 5: } only offloading } Time steps: 2m -1 + n + p - 2 + 1 Phase 1 &2&3 Phase 4

  26. Systolic Arrays for Matrix Multiplication 26 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 5: } only offloading } Time steps: n + 2m + p - 2

  27. Key Challenge 27 The systolic arrays proposed by prior work are not scalable: } Their latency grows linearly with the size of the inputs } Latency is the key metric for single-batch inference A n × m × B m × p = C n × p Non-Stationary Stationary Time steps: 2n + m + p - 2 Time steps: n + 2m + p - 2

  28. Key Insight and Proposed Systolic Array 31 Matrix multiplication consists of } Multiplication } Additions This can be done in log(m) for m numbers p In optimized implementation } Latency increases sublinearly with the input size B m We propose a systolic array with separate n Multiplier array } a multiplier Adder-tree array } A m an adder tree Time steps: n + 2m + p - 2 m + log(m)

  29. Our proposed systolic array 32 One operand (here, B) is stationary p B m n A n × m × B m × p = C n × p a multiplier A m an adder tree

  30. Our proposed systolic array 33 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 1: } only loading B } Time steps: 1

  31. Our proposed systolic array 34 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 1: } only loading B } Time steps: m-1

  32. Our proposed systolic array 35 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 2: } loading B and multiplication } Time steps: m - 1 + 1 Phase 1

  33. Our proposed systolic array 36 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 3: } multiplication and addition } Time steps: m + 1 Phase 1 &2

  34. Our proposed systolic array 37 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 3: } multiplication and addition } Time steps: m + 2 Phase 1 &2

  35. Our proposed systolic array 38 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 3: } multiplication and addition } Time steps: m + 3 Phase 1 &2

  36. Our proposed systolic array 39 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 3: } multiplication and addition } Time steps: m + 4 Phase 1 &2

  37. Our proposed systolic array 40 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } only addition } Time steps: m + n + p - 2 + 1 Phase 1 &2 Phase 3

  38. Our proposed systolic array 41 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } only addition } Time steps: m + n + p - 2 + 2 Phase 1 &2 Phase 3

  39. Our proposed systolic array 42 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } only addition } Time steps: m + n + p - 2 + 3 Phase 1 &2 Phase 3

  40. Our proposed systolic array 43 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } only addition } Time steps: m + n + p - 2 + log (m) Phase 1 &2 Phase 3

  41. Our proposed systolic array 44 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } only addition } Time steps: n + m + log(m) + p - 2

  42. Implementation 46 Tools and Devices: } ZYNQ XC7z020 } Vivado HLS Benchmark: } DNNs (VGG16, VGGS, AlexNet, CifarNet, ResNet50) Metrics: } Latency } Energy consumption

  43. Results – Speedup and Energy Consumption 47 Our proposed systolic array is } 1.99x faster than non-stationary while consuming 2.12x less energy } 1.83x faster than stationary while consuming 2.27x less energy 3 speed up over non-stationary 2 1.99 1.83 1 0 VGGS AlexNet CifarNet VGG16 ResNet50 GMEAN Stationary Non-stationary Our proposed systolic array

  44. Conclusions 48 Systolic arrays have seen significant interest } because of their unique interconnections that satisfies the unique requirement of data reuse in matrix multiplication. Although the systolic arrays in prior work offer high throughput, their latency is not optimized } Latency is the key factor for single-batch inference! To optimize latency, we propose a new systolic array consisting of separate multiplier and adder-tree arrays It is faster than both prior proposals when the size of the operands grows }

Recommend


More recommend