Automatic Generation of High Throughput Energy Efficient Streaming - PowerPoint PPT Presentation

Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary Fixed Permutations Ren Chen, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering Presented by: Ajitesh Srivastava, Department of Computer Science University of Southern California Ganges.usc.edu/wiki/TAPAS

Outline  Introduction  Background and Related Work  High Throughput and Energy Efficient Design  Experimental Results  Conclusion and Future Work 2

Permutation  Permutation  A permutation can be represented using 𝑧 = 𝑄 𝑛 ∙ 𝑦  𝑛 is the size of vectors 𝑦 and 𝑧  The 𝑛 × 𝑛 bit matrix 𝑄 𝑛 is called as a permutation matrix 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑧 0 𝑧 1 𝑧 2 𝑧 3 3

Related Applications  Key Algorithms: FFT, sorting, Viterbi decoding, etc. Frequency domain Audio analysis Partial differential equations in images Q 8 P 4,2 P 4,2 P 8,4 P 8,2 P 4,2 P 4,2 P 4,2 P 4,2 P 4,2 P 4,2 Input Output Bitonic sort Image filtering OFDM System 4

Data Permutation in Conventional Architectures  Permutation by wires Processing Processing Processing Element Element Element  Parallel architecture Processing Processing Processing Output  Input Element Element Element Permutation by memory ... or registers Processing Processing Processing Element Element Element  Pipeline architecture Processing Processing Processing Element Element Element  Shared memory architecture (a) Parallel Architecture Shared memory Bank 1 Bank 2 Bank r Processing Processing Processing Element Element Element Processing Element (c) Pipeline Architecture (b) Shared memory Architecture 5

Data Permutation in Streaming Architectures  Streaming architecture  High data parallelism  High design throughput  Simple control scheme  No requirement on data input/ output order 6

Problem Definition Permute streaming data with a fixed data parallelism  Input/output: in a streaming manner and at a fixed rate Data parallelism 𝑞 : # of inputs processed each cycle per computation stage   Streaming permutation: permutation between adjacent computation stages  Processing elements: computation units for a given application FPGA Streaming Permutation Processing elements Processing elements Processing elements Memory Interface Stream output Stream input External … p memory … … … … … 7

Related Work (JVSP ’07, T. JARVINEN )  For stride permutation on array processor  Flexible data parallelism  Mathematical formulation 9

Related Work (DAC ’12, M. Zuluaga and M. Püschel)  Domain-specific language based  Hardware generator for data permutations in sorting 10

Proposed Design Approach  Drawbacks of the state-of-the-art  Only supports specific permutation patterns  Design scalability needs to be improved  Not memory efficient  No efficient control logic  We propose a mapping approach to obtain a streaming permutation architecture  Utilizes Benes network for building datapath and generating control logic  Highly optimized wrt. memory efficiency and interconnection complexity Scalable with problem size 𝑂 and data parallelism 𝑞   Supports processing continuous data streams  Design automation tool 11

Benes Network Multistage network to realize all 𝑜! permutations   Rearrangeably non-blocking 12

Architecture Overview  Parameterized architecture  Problem size 𝑂  Data parallelism 𝑞  Memory based 𝑞 independent memory blocks  Each of size 𝑂/𝑞  𝑞 -to- 𝑞 connection network  𝑞  2 log 𝑞 2 × 2 switches  Optimal compared with state -of-the art • Highly optimized control unit 14

Proposed Mapping Approach  Vertically fold the Benes Network  Build a three-stage datapath  Divide-and-conquer based method  For a fixed data parallelism 𝑞  Support continuous data streams 15

Automatic Generation of the Datapath (1)  GDP( 𝑂, 𝑞 ): Generating Datapath  𝑂 : problem size, 𝑞 : data parallelism 𝐵 : upper part of datapath, 𝐶 : lower part of datapath  16

Automatic Generation of the Datapath (2) 17

Automatic Generation of Control Logic (1)  Configuration bits of switch in different states 18

Automatic Generation of Control Logic (2)  Single Stage Routing  𝑌 : input data vector 𝑍 : permuted data vector  𝜌 : mapping from 𝑌 to 𝑍  𝑌′ : output data vector of  input switches  𝑍′ : input data vector of output switches 19

Automatic Generation of Control Logic (3)  Multiple Stage Routing  𝑌 : input data vector 𝑍 : permuted data vector  𝑞 : data parallelism  20

An Example 21

Resource Consumption Summary 25

Outline • Introduction • Background and Related Work • High Throughput and Energy Efficient Design • Experimental Results • Conclusion and Future Work 26

Performance metrics  Throughput  Defined as the number of bits permuted per second (Gbits/s)  Product of number of data elements permuted per second and data width per element  Energy efficiency  Defined as the number of bits permuted per unit energy consumption (Gbits/Joule)  Calculated as the throughput divided by the average power consumption 27

Experimental Setup  Platform and tools  Xilinx Virtex-7 XC7VX980T , speed grade -2L  Xilinx Vivado 2014.2 and Vivado Power Analyzer  Input vectors for simulation  Randomly generated with an average toggle rate of 25% (pessimistic estimation)  Performance metrics  Resource consumption  Throughput  Energy efficiency 28

Experimental Results (1)  BRAM consumption of the proposed design  Theoretic memory requirement: reduced by 50% and 75% Amount of BRAM18: reduced by up to 50% compared with the state-of-the-art for various 𝑞  29

Experimental Results (2)  BRAM consumption of the proposed design  Theoretic memory requirement: reduced by 50% and 75% Amount of BRAM18: reduced by up to 50% compared with the state-of-the-art for various 𝑂  30

Experimental Results (3)  LUT consumption of the proposed design (for various p)  22.1%~67.2% less LUTs compared with [1]  59.1%~96.4% less LUTs compared with [6] 31

Experimental Results (4)  LUT consumption of the proposed design (for various N)  6.6%~65.4% less LUTs compared with [1]  59.1%~96.4% less LUTs compared with [6] 32

Experimental Results (5)  Throughput performance of the proposed design  Our designs achieve  Up to 78% throughput improvement compared with [1]  Up to 3.4x throughput compare with [6] 33

Experimental Results (6)  Energy efficiency comparison  2.1x~3.3x energy efficiency improvement compared with the state-of-the-art in [6] 34

Conclusion and Future Work  Conclusion  Streaming data permutation architecture  Scalable with data parallelism and problem size  Efficient data permutation realization  “Programmable” data permutation engine  High throughput and resource efficient  Future work  Design framework for automatic application-specific energy efficiency and performance optimizations on FPGA 35 35

Thanks! Questions? renchen@usc.edu (Ren Chen) Ganges.usc.edu/wiki/TAPAS 36 36

Automatic Generation of High Throughput Energy Efficient Streaming - PowerPoint PPT Presentation

Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary Fixed Permutations Ren Chen, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering Presented by: Ajitesh Srivastava, Department of

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Applicability of Free Energy Applicability of Free Energy Calculations using High-Throughput

A simple tool from a complex system: A simple tool from a complex system: high- -throughput,

A Framework for Automatic Generation A Framework for Automatic Generation of Configuration Files

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

NAVIGATING BIG DATA with High-Throughput, Energy- Efficient Data Partitioning Lisa Wu, R.J.

High Throughput Computing Notebooks HTCondor Week 2019 Todd Tannenbaum Center for High

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Automatic Generation of Efficient Dynamic Binary Translators Fr ed eric P etrot, Luc

2019 Top Markets Report Methodology and Projected Priority Markets Office of Energy and

St Marks C of E Primary School Year 4 Curriculum Evening Year Group Leader: 4G Helen Grant 4T

Thurnham Infant School Year 2 Assessment Evening Thursday 6 th February 2020 1 KS1 National

Guide for Parents/Carers When is SATS Week? SATs Tests for Year 6 pupils will take place

The development process often feels chaotic, haphazard, and arbitrary, especially for new

Challenging Tax Regulations: Are Altera and Direct Marketing Game Changers? D.C. Bar, December 8,

Study: Example of Achievable Potential Working Group Meeting Update March 04, 2016 Adoption

What Insights will this project provide Policy Makers and Regulators Types of business

Automatic Generation of High Throughput Energy Efficient Streaming - PowerPoint PPT Presentation

Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary Fixed Permutations Ren Chen, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering Presented by: Ajitesh Srivastava, Department of

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Applicability of Free Energy Applicability of Free Energy Calculations using High-Throughput

A simple tool from a complex system: A simple tool from a complex system: high- -throughput,

A Framework for Automatic Generation A Framework for Automatic Generation of Configuration Files

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

Energy-efficient &amp; High-performance Energy-efficient &amp; High-performance Instruction Fetch

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

NAVIGATING BIG DATA with High-Throughput, Energy- Efficient Data Partitioning Lisa Wu, R.J.

High Throughput Computing Notebooks HTCondor Week 2019 Todd Tannenbaum Center for High

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Automatic Generation of Efficient Dynamic Binary Translators Fr ed eric P etrot, Luc

2019 Top Markets Report Methodology and Projected Priority Markets Office of Energy and

St Marks C of E Primary School Year 4 Curriculum Evening Year Group Leader: 4G Helen Grant 4T

Thurnham Infant School Year 2 Assessment Evening Thursday 6 th February 2020 1 KS1 National

Guide for Parents/Carers When is SATS Week? SATs Tests for Year 6 pupils will take place

The development process often feels chaotic, haphazard, and arbitrary, especially for new

Challenging Tax Regulations: Are Altera and Direct Marketing Game Changers? D.C. Bar, December 8,

Study: Example of Achievable Potential Working Group Meeting Update March 04, 2016 Adoption

What Insights will this project provide Policy Makers and Regulators Types of business

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch