refinements in data manipulation method for coarse
play

Refinements in Data Manipulation Method for Coarse Grained - PowerPoint PPT Presentation

Refinements in Data Manipulation Method for Coarse Grained Reconfigurable Architectures Takuya Kojima and Hideharu Amano Keio University, Japan 14th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC 2019)


  1. Refinements in Data Manipulation Method for Coarse Grained Reconfigurable Architectures Takuya Kojima and Hideharu Amano Keio University, Japan 14th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC 2019)

  2. Importance of Programmability and High Energy Efficiency n Forthcoming n IoT devices n Wearable computers n Edge computing n Challenges for these devices n Programmability n To satisfy various demands n High energy efficiency n To extends long battery life 2

  3. CGRAs: Coarse-Grained Reconfigurable Architectures n CGRAs n Support word-level reconfiguration ( ↔ bit-level of FPGAs) n Have many PEs (Processing Element) in 2D grid n Change functionality for each ALU & interconnection between PEs dynamically or statically 3

  4. Power-hungy Dynamic Reconfiguration n Dynamic Reconfiguration Others n Changes configuration Computation 30% 30% cycle-by-cycle n Provides more flexibility Clock Tree Reconfiguration n Causes large dynamic 15% 25% power consumption Details of power consumption for a dynamic reconfiguration CGRA[1] [1] Ozaki, Nobuaki, et al. "Cool mega-arrays: Ultralow-power reconfigurable accelerator 4 chips." IEEE Micro 31.6 (2011): 6-18.

  5. SF-CGRAs: Straight-Forward CGRAs Permutation Network Permutation Network PE PE PE PE Pipeline Register Date Memory PE PE PE PE PE PE PE PE PE PE PE PE n Key features of straight-forward CGRAs n Limited data flow direction n Pipelined PE array n Less frequent reconfiguration n High energy efficiency n Piperench[2] n XPP[3] n EGRA[4] n RSPA[5] [2] H. Schmit, et al , CICC 2002 [3] M.Petrov, et al, FPL 2004 [4] G. Ansaloni, et al , TVLSI 2011 5 [5] Yoon, Jonghee W., et al . ASP-DAC, 2008.

  6. VPCMA: Variable Pipelined Cool Mega Array [2] n PE array consists of PE PE PE PE ・ ・ ・ n 8 x 12 PEs PE PE PE PE n 7 pipeline registers Pipeline ・ ・ ・ n PE has Registers ・ ・ PE-Array ・ ・ n No Register file ・ ・ n No clock tree PE PE PE PE ・ ・ ・ n Pipeline register works in μ-controller 1. latch mode PE PE PE PE or ・ ・ ・ 2. bypass mode n μ-Controller Data Manipulator n Controls data transfer Data Memory data mem. ↔ PE array [2] N.Ando , et al . "Variable pipeline structure for Coarse Grained Reconfigurable Array CMA." 6 Field-Programmable Technology , 2016.

  7. Computation on the PE array n Fetch registers are connected to input of the PE array n Gather registers are connected to output of the PE array n The micro-controller n Writes data to the fetch registers n Read result from the gather registers Fetch Registers Gather Registers 7

  8. Computation on the PE array n Fetch registers are connected to input of the PE array n Gather registers are connected to output of the PE array n The micro-controller n Writes data to the fetch registers n Read result from the gather registers Fetch Registers Gather Registers 8

  9. Variable Pipeline Structure n No registers in each pipeline 8th PE row stage4 stage 7th PE row → Pure combinational circuit 6th PE row stage3 n Clock tree only for activated 5th PE row pipeline registers stage2 4th PE row 3rd PE row n Variable pipeline structure 2nd PE row stage1 depending on application 1st PE row 9

  10. Multi-cycle Execution on PE Array Cycle Delayed 4 cycles Delay Fetch stage1 stage2 stage3 stage4Gather Fetch stage1 stage2 stage3 stage4Gather Fetch stage1 stage2 stage3 stage4Gather Branch Fetch stage1 stage2 stage3 stage4Gather n Micro-controller n A custom tiny RISC processor controls the processing n ”Fetch” op kicks off the execution Fused into an instruction n “Gather” op writes back the results n “Delay” op specifies delay time of “Gather” execution n “Branch” op makes a loop 10

  11. Multi-cycle Execution on PE Array Cycle Delayed 8 cycles Delay Fetch stage1 stage2 stage3 stage4 Gather NOP Fetch stage1 stage2 stage3 stage4 NOP To adjust the timing Fetch stage1 stage2 stage3 by inserting other instr. Branch n Micro-controller Fetch stage1 stage2 n A custom tiny RISC processor controls the processing n ”Fetch” op kicks off the execution Fused into an instruction n “Gather” op writes back the results n “Delay” op specifies delay time of “Gather” execution n “Branch” op makes a loop 11

  12. Data Manipulator of VPCMA Data PE n Data manipulator Data Memory Manipulator Array Fetch reg. Shifted data Fetch Addr. BANK0 PE n Placed between BANK1 PE Next Dmem & PE array BANK2 Fetch Addr. PE n Transfers BANK3 PE Transfer any input data to BANK4 PE T able #0 BANK5 PE any outputs dst. src. mask 0 1 col0 BANK6 PE n Loads at most 1 1 col1 BANK7 PE 0 N/A col2 consecutive 12 data BANK8 PE 2 1 col3 from 12 mem banks 3 1 col4 BANK9 PE 0 N/A col5 BANK10 PE n Increments addr. ... BANK11 PE automatically for next fetch 1st Fetch 12

  13. Data Manipulator of VPCMA n Data manipulator Data PE Data Memory Manipulator Array Shifted data Fetch reg. Fetch Addr. BANK0 PE n Placed between BANK1 PE Dmem & PE array Next BANK2 PE Fetch Addr. n Transfers BANK3 PE Transfer any input data to BANK4 PE T able #0 BANK5 any outputs PE dst. src. mask 0 1 col0 BANK6 PE n Loads at most 1 1 col1 BANK7 PE 0 N/A col2 consecutive 12 data BANK8 PE 2 1 col3 from 12 mem banks 3 1 col4 BANK9 PE 0 N/A col5 BANK10 PE n Increments addr. ... BANK11 PE automatically for next fetch 2nd Fetch 13

  14. Ultra Low Power Consumption of CMA n No-Pipelined version of CMA[6] n Works with Lemon battery n Achieves 743 MOPS/mW (297MOPS/0.4mW) n VPCMA n Keeps the same energy efficiency n Achieves 4x higher peek performance n Problem n Less flexibility because of saving too much energy [6] M.Koichiro, et al . "A 297mops/0.4 mw ultra low power coarse-grained reconfigurable accelerator CMA- 14 SOTB-2." 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig )

  15. Limitation of data handling in VPCMA A0 A1 Array a Too far Loop example B0 B1 Array b Memory allocation in bank memory n Data manipulator cannot access multiple data more than 12 step distance simultaneously → needs data rearrangement → often incurs extra copy of data 15

  16. Limitation of data handling in VPCMA A0 B0 A1 B1 A16 B0 A17 B1 Loop example A32 B0 A33 B1 Copies of array b Memory allocation in bank memory n Data manipulator cannot access multiple data more than 12 distance simultaneously → needs data rearrangement → often incurs extra copy of data 16

  17. Other limitations of VPCMA n Also, VPCMA 1. Suffers from a lack of constant registers for the PE array A PE row (12 PEs) share two const regs. n or borrows from other rows via interconnection → Degrades mappability of complex kernels 2. Depends on a host processor for overall control Micro-controller basically controls data transfer n & loop counter All of other controls (e.g. reconfiguration) are n carried out by the host processor even if trivial change is needed 17

  18. Proposed architecture n A new architecture VPCMA2 n Relaxing aforementioned limitations 1. Improved bank access by new data manipulator 2. Refined connectivity of constant registers PE array has 16 constant registers (same as VPCMA) n All PE can use any 16 registers n 3. Introduced an extended data bus for micro- controller 18

  19. New Data Manipulator Fetch addr. 0x0 Increment 4 Data PE fetch addr Manipulator Data Memory Array ... for each bank offset + Shifted data Fetch reg. 0 0 PE ... + 0 0 PE ... + 0 0 PE ... + 0 0 PE ... + 0 5 PE ... + array a 5 0 array b PE ... + 5 0 PE ... + 5 0 PE ... + 0 0 PE ... + 0 0 PE ... + 0 0 PE ... + 0 0 PE 1st Fetch n Offset values for each bank is introduced n Relaxed the limitation of consecutive data access 19

  20. New Data Manipulator Fetch addr. 0x4 Increment 4 Data PE fetch addr Manipulator Data Memory Array ... for each bank offset + Fetch reg. Shifted data 1 0 PE ... + 1 0 PE ... + 1 0 PE ... + 1 0 PE ... + 0 0 PE ... + shifted array a 0 0 array b PE ... + 0 0 PE ... + 0 0 PE ... + 5 0 PE ... + 5 0 PE ... + 5 0 PE ... + 5 0 PE 2nd Fetch n Offset values for each bank is introduced n Relaxed the limitation of consecutive data access 20

  21. Extended Data Bus n Micro-controller can handle any data in other modules External host processor 22 32 External Bus Address Bus (22bit) Data Bus (32bit) 32 22 25 22 25 22 16 22 32 22 Config. Controller Inst. DMAC 20x96 Mem Constant Data Config. Register Mem PE Array 16 Registers 20x96 Micro 25x96 Controller 25x12 32 22 32 22 22 32 32 22 General-purpose bus Address Bus (22bit) for micro-controller Data Bus (32bit) 21

  22. Evaluation Setup n An implementation of VPCMA2 n Using Renesas SOTB 65-nm technology n LSTP (Low STanby Power) version n Synthesized by Synopsys Design Compiler 2017 n A real chip of VPCMA[7] PE Array n Fabricated same technology n LP (Low Power) version 3mm (75% slower than LSTP) TCI [7] T. Kojima, et al . “Real chip evaluation of a low power 6mm CGRA with optimized application mapping,” 9th International Symposium on Highly-Efficient Accelerators Chip photo of VPCMA[7] and Reconfigurable Technologies. ACM, 2018, p. 13. 22

Recommend


More recommend