A Parallel Decimal Multiplier Using Hybrid Binary Coded Decimal - - PowerPoint PPT Presentation

a parallel decimal multiplier using hybrid binary coded
SMART_READER_LITE
LIVE PREVIEW

A Parallel Decimal Multiplier Using Hybrid Binary Coded Decimal - - PowerPoint PPT Presentation

A Parallel Decimal Multiplier Using Hybrid Binary Coded Decimal (BCD) Codes Xiaoping Cui, Weiqiang Liu* and Wenwen Dong College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China


slide-1
SLIDE 1

A Parallel Decimal Multiplier Using Hybrid Binary Coded Decimal (BCD) Codes

Xiaoping Cui, Weiqiang Liu* and Wenwen Dong

College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China

Fabrizio Lombardi

Department of Electrical and Computer Engineering, Northeastern University, Boston, USA

slide-2
SLIDE 2

Outline

Motivation Review of BCD Representations and Decimal Multiplier

2

The Proposed Partial Product Tree Evaluation and Comparison Conclusions

slide-3
SLIDE 3

Motivation

Why Decimal Arithmetic is Needed?

Binary arithmetic introduces conversion and rounding errors Decimal arithmetic is highly demanded in many applications 3 (financial, commercial and so on) that cannot tolerant errors. Decimal specification has been added to the revised IEEE 754-2008 standard. High performance decimal arithmetic circuits are required.

slide-4
SLIDE 4

Introduction

Partial Production Generation

Sign Digit (SD) Radix-10 recoding Redundant BCD excess-3 (XS-3) Overloaded decimal digit set (ODDS) code Double BCD recoding

Generation of Multiplier 5X 4X 3X 2X 1X XS-3 digits in [-3,12] Selection of Multiples (MUX-5) PP[0] … PP[k] … PP[d-1] PP[d] ODDS digits in [0,15] Radix-10 recoder 4d X(BCD) 4d Y(BCD) 6d . . . . . . Yb0 YbK Ybd-1 6 6 6 PPG Block 4(d+1) 4(d+1) 4(d+1) 4(d+1) 4d 4(d+1) 4(d+1) 4(d+1) 4d ... ...

Partial Product Compression

Decimal 3:2 CSA Binary compressor

Final Decimal Adder

Parallel prefix/carry select adder 4

(d+1):2 PPR tree ODDS digits in [0,15] A(BCD XS-6) B(BCD-8421) BCD Adder (2d digits) 8d 8d 8d P(BCD) [7] A. Vazquez, E. Antelo, and J. Bruguera, “Fast Radix-10 Multiplication Using Redundant BCD codes”, IEEE Transactions on Computers, vol. 63,

  • no. 8, pp. 1902–1914, Aug. 2014.
slide-5
SLIDE 5

Partial Production Generation

  • SD Radix-10 recoding scheme

SD Radix-10 Recoding

yi,3yi,2yi,1yi,0 y5i y4i y3iy2iy1i (ysi-1=0) ysi y5i y4i y3iy2iy1i (ysi-1=1) ysi 0000 00000 00001 0001 00001 00010

1

5&& 5

i i i

Y Y Y − < <   + < ≥

5

0010 00010 00100 0011 00100 01000 0100 01000 10000 0101 10000 01000 1 0110 01000 1 00100 1 0111 00100 1 00010 1 1000 00010 1 00001 1 1001 00001 1 00000 1

1 1 1

1 5&& 5 (10 ) 5&& 5 (10 ) 1 5&& 5

i i i i i i i i i i

Y Y Y Yb Y Y Y Y Y Y

− − −

+ < ≥  = − − ≥ <  − − + ≥ ≥ 

slide-6
SLIDE 6

Partial Production Generation

Redundant BCD Codes 6

slide-7
SLIDE 7

XS-3 Recoding (Redundant Odds [0, 15])

Partial Production Generation

N*Xd-1+3…………… N*Xi+3 N*Xi-1+3 ……… N*X0+3 4 4 4 4 X0 Xi-1 Xi Xd-1 D0 T0 Di-1 Ti-1 Ti-2 Di Ti Dd-1 Td-2 Td-1 ……………… …………… Digit-set [0,9] STEP 1: Digit Mappings [3,9N+3]

X(BCD)

4d

╳5

╳5 ╳4 ╳3 ╳2

Xi+3 5X 4X 3X 2X 1X Digits in XS-3[-3,12]

4(d+1) 4(d+1) 4(d+1) 4(d+1) 4d

+

D0 T0 Di-1 Ti-1 Ti-2 Di Ti Dd-1 Td-2 Td-1

+ +

4 4 4 ……… 4 … NX0 NXi-1 NXi NXd-1 Carry-out STEP 2: Carry assimilation [-3,12]

Digits in XS-3[-3,12]

MUX-5 4 4 4 4 4 5Xi 4Xi 3Xi 2Xi 1Xi 1 1 1 1 1 1 1 1 SD Radix-10 1digit encoding Y5kY4k Y3kY2kY1k 4 Yk(BCD) Ysk Ysk-1 Ysk Ysk

*

Digit in XS-3[-3,12] Digits in ODDS[0,15]

Convert the XS-3 digits to ODDS by adding pre-computed correction term: fc(16)=1032 + 07407407407407417037037037037037 Advantage of XS-3 Codes: difficult multiples (such as 3X) can be obtained in a carry-free manner

7

slide-8
SLIDE 8

Decimal PP Compression Using ODDS

Generation of Multiplier 5X 4X 3X 2X 1X XS-3 digits in [-3,12] Selection of Multiples (MUX-5) PP[0] … PP[k] … PP[d-1] PP[d] Radix-10 recoder 4d X(BCD) 4d Y(BCD) 6d . . . . . . Yb0 YbK Ybd-1 6 6 6 PPG Block 4(d+1) 4(d+1) 4(d+1) 4(d+1) 4d

Partial Product Compression

The (d+1:2) PP Reduction (PPR): (1) A regular binary CSA tree (2) A binary counter is used to count

PP[0] … PP[k] … PP[d-1] PP[d] (d+1):2 PPR tree ODDS digits in [0,15] A(BCD XS-6) B(BCD-8421) BCD Adder (2d digits) 4(d+1) 4(d+1) 4(d+1) 4d ... ... 8d 8d 8d P(BCD)

(3) The ODDS partial products in (1) and (2) are added by the binary CSA tree and the decimal digit 3:2 compressor

(2) A binary counter is used to count carries generated between the digit columns in the binary CSA tree

8

slide-9
SLIDE 9

Decimal 3:2 CSA Decimal PP Compression Based on BCD-4221/521 1

ci,j bi,j ai,j

Partial Product (PP) Compression

si,j hi,j

9

slide-10
SLIDE 10

Proposed Design

A B

P Partial product compression Partial product generation Final decimal adder New Design of PPR Tree Block

10

slide-11
SLIDE 11

Proposed Design: A New PPR Tree

Proposed PPR (reduction) Tree (d+1):2 Binary CSA Tree (ODDS to 4221) BCD-4221 Sum Correction Block (Decimal Counter) A Decimal Digit 3:2 Compressor (BCD-4221)

11

slide-12
SLIDE 12

A PPR Tree FOR 16*16-digit multiplier

17:2 Binary PPR tree

8-bit counter 3-bit 3-bit counter · · · 4 1 8 1 3 3 2 PPi[0] PPi[k] PPi[16] . . . . . . Ci-1[0] · · · · · · 3 1 ui,3; ui,2; ui,1 vi,1 Ci[0] · · · Ci[7] Ci[8] Ci[9] Ci[10] Ci[11] Ci-1[K] Ci-1[13] 4 2 BCD-4221 Sum Correction Block p1 p2 p3 p4 cout2 cin p1 p2 p3 p4 cout2 cin p1 p2 p3 p4 cout2 cin p1 p2 p3 p4 cout2 cin +6 c

The No. of PP rows in the 1st, 2nd, 3rd and 4th stages are 17, 9, 6 and 4, respectively.

counter 6:2 Decimal PPR Tree 4 4 2 4221 4221 2*4221 x6 x6 x6 4221 4221 4221 4221 4*4221 5 5 1 1 3 1 4 4 Bi Ai zi,1 Ci[12] Ci[13] Hi(4221) Si(4221) x2 x1 ui-1,3; ui-1,2; ui-1,1 vi-1,1 zi-1,1 cout2 cin cout1 sum cout2 cin cout1 sum cout2 cin cout1 sum cout1 sum

(2) (1) (4) (2) (8) (4) (16) (8)

si,3 ci,3

ci[13]

si,2 ci,2 si,1 ci,1 si,0 ci,0

ci-1[13]

12

4-bit binary 4:2 compressor in last compression stage

slide-13
SLIDE 13

the Proposed PPR Tree

3:2 3:2 HA 3:2 3:2

ui,3ui,2ui,1ui-1,1 ui,0ui,0ui-1,3ui-1,2

3:2 3:2 3:2 C[4] C[0] C[5] C[6] C[7] C[1] C[2] C[3] C[8]C[9]C[10] C[11]C[12]C[13] 4 4 4 4

ui,3

1

ui,2

1

ui,1

1

ui,0

1

vi,1

1

vi,0

1

vi,0vi,0vi,1vi-1,1 zi,1

1

zi,0

1

zi,0zi,0zi,1zi-1,1

BCD-4221 Sum Correction Block F 8-bit counter 3-bit counter

BCD-4221 8-bit and 3-bit counter correction

The 8-bit BCD-4221 counter is faster than a binary counter (only two 3:2 CSA delay). 3-bit counters are used to generate a

3:2 x2 3:2 3:2 x2 x2 4221 4221 x2 x2 4 4 6:2 Decimal PPR Tree Block Ai(4221) Bi(4*4221) Hi(4221) Si(4221) 4 4 x2 x1 F F F

6:2 decimal PPR tree block

13

BCD-4221 decimal correction digit by using only one 3:2 compressor. To balance the paths in the decimal 6:2 PPR tree and reduce the critical path.

slide-14
SLIDE 14

Using Hybrid (Multiple) BCD Codes

4*4221 4221 BCD-4221 PPR Tree BCD- 8421 PPG XS-3 2*4221 4221 ODDS BCD-4221 sum correction block Binary PPR Tree Adder set excess-6 BCD-8421 Decimal Adder BCD- 8421

14

slide-15
SLIDE 15

Advantages of the proposed PPR tree 1

A BCD-4221 counter is faster than a binary counter (a 8-bit counter has two 3:2 CSA stages, and 3-bit counter has one 3:2 CSA stage.)

2

A non-fixed size BCD-4221 counter correction block is used to

3 2

A non-fixed size BCD-4221 counter correction block is used to balance the paths and reduce the critical path delay of decimal 6:2 PPR tree. The final two PP rows are generated using a decimal PPR tree based on BCD-4221 that is easy to be converted to BCD-8421.

15

slide-16
SLIDE 16

Evaluation

Block Delay #FO4 Area #NAND2 PPG Stage

10.2 14900

Area and Delay (LE-Based Model) for the Proposed 16×16-digit Multipliers.

PPR Tree

25.3 14306

Adder Setup

3.2 1050

Decimal Adder

11.5 2400

Total

50.2 32656 16

slide-17
SLIDE 17

Evaluation

Design Delay #FO4 Area #NAND2 Non-Redundant [13]

58.3 35750

Area and Delay (LE-Based) Comparision for Different BCD Multiplier Designs.

Redundant [7]

51.4 30600

Proposed

50.2 Compared with [13] -13.89% Compared with [7] -2.23% 32656 Compared with [13] -8.65% Compared with [7] +6.05%

[7] A. Vazquez, E. Antelo, and J. Bruguera, “Fast Radix-10 Multiplication Using Redundant BCD codes”, IEEE Transactions on Computers, vol. 63, no. 8, pp. 1902–1914, Aug. 2014. [13] A. Vazquez, E. Antelo and P. Montuschi, “Improved Design of High-Performance Parallel Decimal Multipliers”, IEEE Transactions

  • n Computers, vol. 59, no. 5, pp. 679–693, May 2010.

17

slide-18
SLIDE 18

Evaluation

Design Delay (ns) Ratio

Area(μm2)

Ratio Proposed

3.21 1 43053.5 1

Area and Delay Comparison Using NanGate 45nm open cell library

The proposed design reduces the delay by 12.30% and the area by 10.9% compared with [13]. [7] reduces the delay by 10.75% and the area by 11.1% compared with [13] (no direct comparison as some parts of PPR circuit of [7] are not provided in detail).

Non-Redundant [13]

3.66 1.14 48326.1 1.12 18

slide-19
SLIDE 19

Conclusion

Design of parallel decimal multiplier is studied A parallel decimal multiplier based on a new PPR tree is proposed by using: A BCD-4221 sum correction block with non-fixed size counters, 19 A decimal PPR tree based on BCD-4221 decimal digit 3:2 compressor. The proposed parallel decimal multiplier is faster than previous best designs.

slide-20
SLIDE 20

Thank you! Thank you! Questions?