automatic generation of high throughput energy efficient
play

Automatic Generation of High Throughput Energy Efficient Streaming - PowerPoint PPT Presentation

Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary Fixed Permutations Ren Chen, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering Presented by: Ajitesh Srivastava, Department of


  1. Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary Fixed Permutations Ren Chen, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering Presented by: Ajitesh Srivastava, Department of Computer Science University of Southern California Ganges.usc.edu/wiki/TAPAS

  2. Outline  Introduction  Background and Related Work  High Throughput and Energy Efficient Design  Experimental Results  Conclusion and Future Work 2

  3. Permutation  Permutation  A permutation can be represented using 𝑧 = 𝑄 𝑛 ∙ 𝑦  𝑛 is the size of vectors 𝑦 and 𝑧  The 𝑛 × 𝑛 bit matrix 𝑄 𝑛 is called as a permutation matrix 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 0 𝑧 1 𝑧 2 𝑧 3 3

  4. Related Applications  Key Algorithms: FFT, sorting, Viterbi decoding, etc. Frequency domain Audio analysis Partial differential equations in images Q 8 P 4,2 P 4,2 P 8,4 P 8,2 P 4,2 P 4,2 P 4,2 P 4,2 P 4,2 P 4,2 Input Output Bitonic sort Image filtering OFDM System 4

  5. Data Permutation in Conventional Architectures  Permutation by wires Processing Processing Processing Element Element Element  Parallel architecture Processing Processing Processing Output  Input Element Element Element Permutation by memory ... or registers Processing Processing Processing Element Element Element  Pipeline architecture Processing Processing Processing Element Element Element  Shared memory architecture (a) Parallel Architecture Shared memory Bank 1 Bank 2 Bank r Processing Processing Processing Element Element Element Processing Element (c) Pipeline Architecture (b) Shared memory Architecture 5

  6. Data Permutation in Streaming Architectures  Streaming architecture  High data parallelism  High design throughput  Simple control scheme  No requirement on data input/ output order 6

  7. Problem Definition Permute streaming data with a fixed data parallelism  Input/output: in a streaming manner and at a fixed rate Data parallelism 𝑞 : # of inputs processed each cycle per computation stage   Streaming permutation: permutation between adjacent computation stages  Processing elements: computation units for a given application FPGA Streaming Permutation Processing elements Processing elements Processing elements Memory Interface Stream output Stream input External … p memory … … … … … 7

  8. Outline  Introduction  Background and Related Work  High Throughput and Energy Efficient Design  Experimental Results  Conclusion and Future Work 8

  9. Related Work (JVSP ’07, T. JARVINEN )  For stride permutation on array processor  Flexible data parallelism  Mathematical formulation 9

  10. Related Work (DAC ’12, M. Zuluaga and M. Püschel)  Domain-specific language based  Hardware generator for data permutations in sorting 10

  11. Proposed Design Approach  Drawbacks of the state-of-the-art  Only supports specific permutation patterns  Design scalability needs to be improved  Not memory efficient  No efficient control logic  We propose a mapping approach to obtain a streaming permutation architecture  Utilizes Benes network for building datapath and generating control logic  Highly optimized wrt. memory efficiency and interconnection complexity Scalable with problem size 𝑂 and data parallelism 𝑞   Supports processing continuous data streams  Design automation tool 11

  12. Benes Network Multistage network to realize all 𝑜! permutations   Rearrangeably non-blocking 12

  13. Outline  Introduction  Background and Related Work  High Throughput and Energy Efficient Design  Experimental Results  Conclusion and Future Work 13

  14. Architecture Overview  Parameterized architecture  Problem size 𝑂  Data parallelism 𝑞  Memory based 𝑞 independent memory blocks  Each of size 𝑂/𝑞  𝑞 -to- 𝑞 connection network  𝑞  2 log 𝑞 2 × 2 switches  Optimal compared with state -of-the art • Highly optimized control unit 14

  15. Proposed Mapping Approach  Vertically fold the Benes Network  Build a three-stage datapath  Divide-and-conquer based method  For a fixed data parallelism 𝑞  Support continuous data streams 15

  16. Automatic Generation of the Datapath (1)  GDP( 𝑂, 𝑞 ): Generating Datapath  𝑂 : problem size, 𝑞 : data parallelism 𝐵 : upper part of datapath, 𝐶 : lower part of datapath  16

  17. Automatic Generation of the Datapath (2) 17

  18. Automatic Generation of Control Logic (1)  Configuration bits of switch in different states 18

  19. Automatic Generation of Control Logic (2)  Single Stage Routing  𝑌 : input data vector 𝑍 : permuted data vector  𝜌 : mapping from 𝑌 to 𝑍  𝑌′ : output data vector of  input switches  𝑍′ : input data vector of output switches 19

  20. Automatic Generation of Control Logic (3)  Multiple Stage Routing  𝑌 : input data vector 𝑍 : permuted data vector  𝑞 : data parallelism  20

  21. An Example 21

  22. Resource Consumption Summary 25

  23. Outline • Introduction • Background and Related Work • High Throughput and Energy Efficient Design • Experimental Results • Conclusion and Future Work 26

  24. Performance metrics  Throughput  Defined as the number of bits permuted per second (Gbits/s)  Product of number of data elements permuted per second and data width per element  Energy efficiency  Defined as the number of bits permuted per unit energy consumption (Gbits/Joule)  Calculated as the throughput divided by the average power consumption 27

  25. Experimental Setup  Platform and tools  Xilinx Virtex-7 XC7VX980T , speed grade -2L  Xilinx Vivado 2014.2 and Vivado Power Analyzer  Input vectors for simulation  Randomly generated with an average toggle rate of 25% (pessimistic estimation)  Performance metrics  Resource consumption  Throughput  Energy efficiency 28

  26. Experimental Results (1)  BRAM consumption of the proposed design  Theoretic memory requirement: reduced by 50% and 75% Amount of BRAM18: reduced by up to 50% compared with the state-of-the-art for various 𝑞  29

  27. Experimental Results (2)  BRAM consumption of the proposed design  Theoretic memory requirement: reduced by 50% and 75% Amount of BRAM18: reduced by up to 50% compared with the state-of-the-art for various 𝑂  30

  28. Experimental Results (3)  LUT consumption of the proposed design (for various p)  22.1%~67.2% less LUTs compared with [1]  59.1%~96.4% less LUTs compared with [6] 31

  29. Experimental Results (4)  LUT consumption of the proposed design (for various N)  6.6%~65.4% less LUTs compared with [1]  59.1%~96.4% less LUTs compared with [6] 32

  30. Experimental Results (5)  Throughput performance of the proposed design  Our designs achieve  Up to 78% throughput improvement compared with [1]  Up to 3.4x throughput compare with [6] 33

  31. Experimental Results (6)  Energy efficiency comparison  2.1x~3.3x energy efficiency improvement compared with the state-of-the-art in [6] 34

  32. Conclusion and Future Work  Conclusion  Streaming data permutation architecture  Scalable with data parallelism and problem size  Efficient data permutation realization  “Programmable” data permutation engine  High throughput and resource efficient  Future work  Design framework for automatic application-specific energy efficiency and performance optimizations on FPGA 35 35

  33. Thanks! Questions? renchen@usc.edu (Ren Chen) Ganges.usc.edu/wiki/TAPAS 36 36

Recommend


More recommend