Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary Fixed Permutations Ren Chen, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering Presented by: Ajitesh Srivastava, Department of Computer Science University of Southern California Ganges.usc.edu/wiki/TAPAS
Outline Introduction Background and Related Work High Throughput and Energy Efficient Design Experimental Results Conclusion and Future Work 2
Permutation Permutation A permutation can be represented using 𝑧 = 𝑄 𝑛 ∙ 𝑦 𝑛 is the size of vectors 𝑦 and 𝑧 The 𝑛 × 𝑛 bit matrix 𝑄 𝑛 is called as a permutation matrix 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 0 𝑧 1 𝑧 2 𝑧 3 3
Related Applications Key Algorithms: FFT, sorting, Viterbi decoding, etc. Frequency domain Audio analysis Partial differential equations in images Q 8 P 4,2 P 4,2 P 8,4 P 8,2 P 4,2 P 4,2 P 4,2 P 4,2 P 4,2 P 4,2 Input Output Bitonic sort Image filtering OFDM System 4
Data Permutation in Conventional Architectures Permutation by wires Processing Processing Processing Element Element Element Parallel architecture Processing Processing Processing Output Input Element Element Element Permutation by memory ... or registers Processing Processing Processing Element Element Element Pipeline architecture Processing Processing Processing Element Element Element Shared memory architecture (a) Parallel Architecture Shared memory Bank 1 Bank 2 Bank r Processing Processing Processing Element Element Element Processing Element (c) Pipeline Architecture (b) Shared memory Architecture 5
Data Permutation in Streaming Architectures Streaming architecture High data parallelism High design throughput Simple control scheme No requirement on data input/ output order 6
Problem Definition Permute streaming data with a fixed data parallelism Input/output: in a streaming manner and at a fixed rate Data parallelism 𝑞 : # of inputs processed each cycle per computation stage Streaming permutation: permutation between adjacent computation stages Processing elements: computation units for a given application FPGA Streaming Permutation Processing elements Processing elements Processing elements Memory Interface Stream output Stream input External … p memory … … … … … 7
Outline Introduction Background and Related Work High Throughput and Energy Efficient Design Experimental Results Conclusion and Future Work 8
Related Work (JVSP ’07, T. JARVINEN ) For stride permutation on array processor Flexible data parallelism Mathematical formulation 9
Related Work (DAC ’12, M. Zuluaga and M. Püschel) Domain-specific language based Hardware generator for data permutations in sorting 10
Proposed Design Approach Drawbacks of the state-of-the-art Only supports specific permutation patterns Design scalability needs to be improved Not memory efficient No efficient control logic We propose a mapping approach to obtain a streaming permutation architecture Utilizes Benes network for building datapath and generating control logic Highly optimized wrt. memory efficiency and interconnection complexity Scalable with problem size 𝑂 and data parallelism 𝑞 Supports processing continuous data streams Design automation tool 11
Benes Network Multistage network to realize all 𝑜! permutations Rearrangeably non-blocking 12
Outline Introduction Background and Related Work High Throughput and Energy Efficient Design Experimental Results Conclusion and Future Work 13
Architecture Overview Parameterized architecture Problem size 𝑂 Data parallelism 𝑞 Memory based 𝑞 independent memory blocks Each of size 𝑂/𝑞 𝑞 -to- 𝑞 connection network 𝑞 2 log 𝑞 2 × 2 switches Optimal compared with state -of-the art • Highly optimized control unit 14
Proposed Mapping Approach Vertically fold the Benes Network Build a three-stage datapath Divide-and-conquer based method For a fixed data parallelism 𝑞 Support continuous data streams 15
Automatic Generation of the Datapath (1) GDP( 𝑂, 𝑞 ): Generating Datapath 𝑂 : problem size, 𝑞 : data parallelism 𝐵 : upper part of datapath, 𝐶 : lower part of datapath 16
Automatic Generation of the Datapath (2) 17
Automatic Generation of Control Logic (1) Configuration bits of switch in different states 18
Automatic Generation of Control Logic (2) Single Stage Routing 𝑌 : input data vector 𝑍 : permuted data vector 𝜌 : mapping from 𝑌 to 𝑍 𝑌′ : output data vector of input switches 𝑍′ : input data vector of output switches 19
Automatic Generation of Control Logic (3) Multiple Stage Routing 𝑌 : input data vector 𝑍 : permuted data vector 𝑞 : data parallelism 20
An Example 21
Resource Consumption Summary 25
Outline • Introduction • Background and Related Work • High Throughput and Energy Efficient Design • Experimental Results • Conclusion and Future Work 26
Performance metrics Throughput Defined as the number of bits permuted per second (Gbits/s) Product of number of data elements permuted per second and data width per element Energy efficiency Defined as the number of bits permuted per unit energy consumption (Gbits/Joule) Calculated as the throughput divided by the average power consumption 27
Experimental Setup Platform and tools Xilinx Virtex-7 XC7VX980T , speed grade -2L Xilinx Vivado 2014.2 and Vivado Power Analyzer Input vectors for simulation Randomly generated with an average toggle rate of 25% (pessimistic estimation) Performance metrics Resource consumption Throughput Energy efficiency 28
Experimental Results (1) BRAM consumption of the proposed design Theoretic memory requirement: reduced by 50% and 75% Amount of BRAM18: reduced by up to 50% compared with the state-of-the-art for various 𝑞 29
Experimental Results (2) BRAM consumption of the proposed design Theoretic memory requirement: reduced by 50% and 75% Amount of BRAM18: reduced by up to 50% compared with the state-of-the-art for various 𝑂 30
Experimental Results (3) LUT consumption of the proposed design (for various p) 22.1%~67.2% less LUTs compared with [1] 59.1%~96.4% less LUTs compared with [6] 31
Experimental Results (4) LUT consumption of the proposed design (for various N) 6.6%~65.4% less LUTs compared with [1] 59.1%~96.4% less LUTs compared with [6] 32
Experimental Results (5) Throughput performance of the proposed design Our designs achieve Up to 78% throughput improvement compared with [1] Up to 3.4x throughput compare with [6] 33
Experimental Results (6) Energy efficiency comparison 2.1x~3.3x energy efficiency improvement compared with the state-of-the-art in [6] 34
Conclusion and Future Work Conclusion Streaming data permutation architecture Scalable with data parallelism and problem size Efficient data permutation realization “Programmable” data permutation engine High throughput and resource efficient Future work Design framework for automatic application-specific energy efficiency and performance optimizations on FPGA 35 35
Thanks! Questions? renchen@usc.edu (Ren Chen) Ganges.usc.edu/wiki/TAPAS 36 36
Recommend
More recommend