Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming - PowerPoint PPT Presentation

Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming Pipelining & Retiming 張錫嘉 Hsie-Chia Chang E-mail : hcchang@mail.nctu.edu.tw Fall 2006

Outline Outline � Pipelining of FI R Digital filters – Data-Broadcast Structures – Fine-Grain Pipelining � Parallel Processing � Pipelining and Parallel Processing for Low Power � Retiming – Definitions and Properties – Solving Systems of Inequalities – Retiming Techniques • Cutset Retiming & Pipelining • Retiming for Clock Period Minimization • Retiming for Register Minimization Optimized Application-Specific I ntegrated Systems 2

I ntroduction I ntroduction – If some real-time application requires a faster input rate, the critical path can be reduced by either pipelining or parallel processing Optimized Application-Specific I ntegrated Systems 3

Pipelining & Parallel Processing (1/ 2) Pipelining & Parallel Processing (1/ 2) � Pipelining – Reduce the effective critical path by introducing pipelining latches along the critical datapath – Without any pipelining latches, the critical path can be reduced by � Parallel processing – Increase the sampling by replicating hardware so that inputs can be processed in parallel; outputs can be produced at the same time � This techniques applied in the non-recursive computations continue sending T sample ≠ T CLK T sample = T CLK Optimized Application-Specific I ntegrated Systems 4

Pipelining & Parallel Processing (2/ 2) Pipelining & Parallel Processing (2/ 2) Example 2: Optimized Application-Specific I ntegrated Systems 5

Pipelining of FI R Digital Filters Pipelining of FI R Digital Filters T Critical = T M + T A Schedule of Events in the Pipelined FIR Filter Optimized Application-Specific I ntegrated Systems 6

Cutset Pipelining (1/ 2) Pipelining (1/ 2) Cutset � The speed is limited by the longest path between – any two latches – an input & a latch – a latch & an output – The input & the output � 2-level pipelined structure – The longest path can be reduced by suitably placing the pipelining latches in the architecture – In this system, at any time, 2 consecutive outputs are computed in an interleaved manner – Drawbacks • • Optimized Application-Specific I ntegrated Systems 7

Cutset Pipelining (2/ 2) Pipelining (2/ 2) Cutset � Cutset � Feed-forward cutset cutset – We can arbitrarily place latches on + k D a feed-forward cutset of any FIR G2 +k D filter structure without affecting the G1 functionality of the algorithm + k D Optimized Application-Specific I ntegrated Systems 8

Example 3.2.1 Example 3.2.1 Optimized Application-Specific I ntegrated Systems 9

Data- -Broadcast Structures Broadcast Structures Data Optimized Application-Specific I ntegrated Systems 10

Fine- -grain Pipelining grain Pipelining Fine Optimized Application-Specific I ntegrated Systems 11

Parallel Processing Parallel Processing � Parallel processing are also referred to as block processing – Block size = no. of inputs processed in a clock cycle – For a 3-tap FRI filter, the duplicate hardware can be shown as: Block delay delay = + − + −  y ( 3 k ) ax ( 3 k ) bx ( 3 k 1 ) cx ( 3 k 2 ) = + − + −  y ( n ) ax ( n ) bx ( n 1 ) cx ( n 2 ) + = + + + −  y ( 3 k 1 ) ax ( 3 k 1 ) bx ( 3 k ) cx ( 3 k 1 )  + = + + + +  y ( 3 k 2 ) ax ( 3 k 2 ) bx ( 3 k 1 ) cx ( 3 k ) � I n MI MO, Optimized Application-Specific I ntegrated Systems 12

Complete Parallel Processing Systems Complete Parallel Processing Systems – A serial-to-parallel converter – A parallel-to-serial converter Optimized Application-Specific I ntegrated Systems 13

Why use Parallel Processing?? Why use Parallel Processing?? � Communication bounded – When the critical path is less than T communication , the I/O bound dominates and this system is communication bounded . – Pipelining can be used only to the extent such that the critical path is limited by the communication bound. – Once this is reached, pipelining can no longer increase the speed Optimized Application-Specific I ntegrated Systems 14

Combined Pipelining & Parallel Processing Combined Pipelining & Parallel Processing – After combining M -level pipelining and L -level parallel processing, Optimized Application-Specific I ntegrated Systems 15

CMOS Power Consumption (1/ 2) CMOS Power Consumption (1/ 2) � P total = P dynamic + P short-circuit + P static � Short circuit – current spikes � Static Power – leakage current Optimized Application-Specific I ntegrated Systems 16

CMOS Power Consumption (2/ 2) CMOS Power Consumption (2/ 2) � Based on simple approximation & 1st-order analysis – Propagation delay ⋅ C V = charge 0 T ( ) pd − 2 k V V 0 t C charge the capacitance to be charged or discharged in a single clock cycle (along the critical path) V 0 、 V t the supply voltage 、 the threshold voltage K a function of technology parameters – Power consumption = ⋅ ⋅ 2 P C V f total 0 C total the total capacitance of the CMOS circuit f clock frequency of the circuit Optimized Application-Specific I ntegrated Systems 17

Low Power Design Low Power Design � To reduce – Capacitances • Transistor/Gate C • Load C • Interconnects • External – Activity – Frequency – Power supply � Other issues – Off-chip connections have high capacitive load – System integration Optimized Application-Specific I ntegrated Systems 18

Pipelining for Low Power (1/ 2) Pipelining for Low Power (1/ 2) � For an M-level pipelined architecture, – the critical path is reduced to 1/ M and the capacitance to be charged/discharged in a single cycle (C charge ) is also reduced to 1/ M � I f the same clock speed is maintained (f = 1/ T pd ), – only 1/M of the non-pipelined capacitance is required to be charged or discharged, which suggests voltage reduction β ⋅ V – Suppose the voltage can be reduced to , 0 ( ) = ⋅ β ⋅ ⋅ the power consumption becomes 2 P C V f pipelined total 0 = β ⋅ 2 P − non pipelined Optimized Application-Specific I ntegrated Systems 19

Pipelining for Low Power (2/ 2) Pipelining for Low Power (2/ 2) – propagation delay of the original architecture – propagation delay of the pipelined architecture – setting the above two equations equal, the following quadratic equation can be obtained to solve β ( ) ( ) β ⋅ − = β ⋅ − 2 2 M V V V V 0 t 0 t Optimized Application-Specific I ntegrated Systems 20

Example 3.4.1: Reduce Power by Pipelining Example 3.4.1: Reduce Power by Pipelining � Consider the following two FI R filters. x(n) x(n) m 1 m 1 m 1 D D D D D y(n) m 2 m 2 m 2 D D y(n) – What is the supply voltage of the pipelined architecture if the clock periods are identical? – What is the relative power consumption? Optimized Application-Specific I ntegrated Systems 21

Solution Solution Optimized Application-Specific I ntegrated Systems 22

Parallel Processing for Low Power (1/ 2) Parallel Processing for Low Power (1/ 2) � For an L-parallel architecture, – the charge capacitance remains the same, but the total capacitance (C total ) is increased L times � To maintain the same sample rate, – The clock speed is reduced to 1/L (f = 1/LT pd ), which means the C charge is charged or discharged L times longer. β ⋅ V – The supply voltage can be reduced to , 0 ( ) ( ) f the power consumption becomes = ⋅ ⋅ β ⋅ ⋅ 2 P L C V parallel total 0 L = β ⋅ 2 P − non parallel Optimized Application-Specific I ntegrated Systems 23

Parallel Processing for Low Power (2/ 2) Parallel Processing for Low Power (2/ 2) – propagation delay of the original architecture – propagation delay of the parallel architecture – setting these two propagation delays equal, the following quadratic equation can be obtained to solve β ( ) ( ) β ⋅ − = β ⋅ − 2 2 L V V V V 0 t 0 t Optimized Application-Specific I ntegrated Systems 24

Example 3.4.2: Reduce Power by Parallel Example 3.4.2: Reduce Power by Parallel � Consider the following two FI R filters, with critical paths denoted in dash lines respectively x(2k) x(n) D y(2k+1) D D D y(n) x(2k+1) D D y(2k) – What is the supply voltage of the parallel architecture? – What is the relative power consumption? Optimized Application-Specific I ntegrated Systems 25

Solution Solution Optimized Application-Specific I ntegrated Systems 26

Example 3.4.3 Example 3.4.3 � Area-efficient architecture Optimized Application-Specific I ntegrated Systems 27

Summary Summary � I n pipelining & parallel processing, – M-level pipelining, – L-level parallel processing, – Combining M-level pipelining & L-level parallel processing, � For low power design, – Pipelining – Parallel Processing – Combining Pipelining and Parallel Processing Optimized Application-Specific I ntegrated Systems 28

Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming - PowerPoint PPT Presentation

Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming Pipelining & Retiming Hsie-Chia Chang E-mail : hcchang@mail.nctu.edu.tw Fall 2006 Outline Outline Pipelining of FI R Digital filters Data-Broadcast Structures

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Viterbi Algorithm Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Monte Carlo Methods and Neural Networks Alexander Keller, partially joint work with Noah Gamboa

Cryptanalysis of MORUS (Initially discussed at Lorentz center in Mar 2018) Tomer Ashur

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University

Quiz 1 Quiz 1 Question 1 Compare the differences between a thread and a process. What do both

Quiz Question: Assuming a preemptive shortest job first algorithm is in effect, a) Draw the Gantt

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling = Selection

CPU Scheduling Prof. Sirer (dr. Willem de Bruijn) CS 4410 Cornell University Problem You are