efficient vlsi architectures for baseband signal
play

Efficient VLSI architectures for baseband signal processing in - PowerPoint PPT Presentation

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal Srikrishna Bhashyam, Joseph R. Cavallaro, and Behnaam Aazhang This work is supported by Nokia, TI, TATP and NSF Motivation


  1. Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal Srikrishna Bhashyam, Joseph R. Cavallaro, and Behnaam Aazhang This work is supported by Nokia, TI, TATP and NSF

  2. Motivation Computationally complex algorithms for base-stations – multiple users, high data rates – matrix inversions, floating point accuracy needed – DSP solutions infeasible for real-time [S.Das’99] Real-time implementations for baseband receiver? – multiuser channel estimation *S.Das et al., “Arithmetic Acceleration Techniques for Wireless Base-station Receivers”, Asilomar 1999

  3. Contributions New estimation scheme – designed from an implementation perspective – bit-streaming, fixed-point architecture – reduced complexity, same error rate performance Real-time architecture design – exploit bit-level parallelism – area-constrained, time-constrained – real-time with minimum area

  4. Baseband signal processing Antenna Multiple Multiuser Decoding Users Information Detection Bits Multiuser Channel Training Tracking estimation Base-Station Receiver

  5. Channel estimation Noise +MAI Base Station Direct Reflected Path Path User 1 User 2 Estimates unknown fading amplitudes and asynchronous delays.

  6. Need for multiuser channel estimation Detector performance depends on estimation accuracy Best estimator : Maximum Likelihood => jointly estimate parameters for all users => Multiuser channel estimation Single-user sliding correlator used for implementation

  7. Multiuser channel estimation algorithm 2K { 1 , 1 } b ∈ − i N R * A R r ∈ C = i bb br 2 K * 2 K R ∈ ℜ bb T b b R = bb � i i 2 K * N R C ∈ L br 2 K * N A C ∈ b H r R = br � i i - Training/Tracking bits b i L r i - Received signal N - Spreading gain (typically fixed ,e.g: 32) K - Number of users (variable, <= N ) A - Maximum Likelihood channel estimate

  8. Outline Background Channel Estimation - An implementation perspective VLSI architectures – Area-constrained, Time-constrained, Area-Time efficient DSP Comparisons and Conclusions

  9. Iterative scheme for channel estimation ( i ) ( i 1 ) ( i 1 ) ( i ) ( i ) A A − ( A − * R R ) = − µ − bb br ( i ) ( i 1 ) T T R R − b * b b * b = + − bb bb L L 0 0 ( i ) ( i 1 ) H H R R − b * r b * r = + − br br L L 0 0 Bit-streaming, method of gradient descent Stable convergence behavior with µ Simple fixed-point architecture

  10. Simulations - Static multipath channel Comparison of Bit Error Rates (BER) -1 10 Iterative Channel Est. SINR = 0 dB Original Channel Est. Paths =3 Training =150 bits BER -2 10 Spreading N = 31 O(K 2 N) Users K = 15 O(K 3 +K 2 N) -3 10 4 5 6 7 8 9 10 11 12 Signal to Noise Ratio (SNR)

  11. Outline Background Channel Estimation - An implementation perspective VLSI architectures – Area-constrained, Time-constrained, Area-Time efficient DSP Comparisons and Conclusions

  12. Design specifications 32 Users (K) 32 spreading code length (N) Target = 128 Kbps – 4000 cycles available at 500 MHz Single cycle addition/multiplication

  13. Task decomposition Tracking Window L Correlation Iterate Matrices (Per Bit) b 0 b L (2K,1) (2K,1) R br O(2KN,8 ) Channel A Estimate O(4K 2 N,8) to Detector R bb r L (N,8) r 0 O(2K 2 ,8) (N,8) TIME

  14. Architecture design ( i ) ( i 1 ) T T R R − b * b b * b = + − bb bb L L 0 0 XNOR gates, UP/DOWN counters ( i ) ( i 1 ) H H R R − b * r b * r = + − br br L L 0 0 8-bit adders ( i ) ( i 1 ) ( i 1 ) ( i ) ( i ) A A − ( A − * R R ) = − µ − bb br 8-bit multipliers [Schulte’93] * Schulte, Swartzlander “Truncated Multiplication with Correction Constant”, Workshop on VLSI Signal Processing,1993

  15. Area-constrained : Min. area, not real- time ( i ) ( i 1 ) T T R R − b * b b * b Channel Estimate = + − bb bb L L 0 0 b L i A (i) A (i-1) R bb j 8 8 8 1 8 Load Store 1 b L DEMUX 1 MUX Counter MUX 1 U/D 8 8 8 b 0 1 MAC Subtract i j 16 8 R br 1 8 >> 1 Subtract b 0 Add/ 8 16 Add/ Sub Sub 1 8 8 1 j j r 0 r L ( i ) ( i 1 ) H H R R − b * r b * r = + − ( i ) ( i 1 ) ( i 1 ) ( i ) ( i ) A A − ( A − * R R ) = − µ − br br L L 0 0 bb br

  16. Area-constrained : Hardware used Blocks Quantity Full Adder Complex Total Cells Counter 1*8 8 - 8 Multiplier 1*8 64 *2 128 Adders 3*8 + 2*16 56 *2 112 Total Area 248 FA cells 4K 2 N Total Time 128,000 cycles (N=K=32)

  17. Time-constrained : Real time, large area ( i ) ( i 1 ) T T R R − b * b b * b = + − bb bb L L 0 0 K(2K-1)*1 2K*1 M b L b*b T U ( i ) ( i 1 ) ( i 1 ) ( i ) ( i ) A A − ( A − * R R ) = − µ − X b 0 b 0 *b 0 T bb br K(2K-1)*1 Channel 2K*1 R bb A Estimate 2K*1 2K 2 *8 2KN*8 MUX Mult Subtract r L M 2K*1 2KN*8 N*8 2KN*16 U >> R br Subtract X r 0 N*8 2KN*8 2KN*16 N*8 ( i ) ( i 1 ) H H R R − b * r b * r = + − br br L L 0 0

  18. Time-constrained : Hardware used Blocks Quantity Full Adder Complex Total Cells 2K 2 *8 16K 2 16K 2 Counter - 4K 2 N*8 256K 2 N 512K 2 N Multiplier *2 Adders 2KN*16 + 48KN + *2 96KN + 64K 2 N 128K 2 N 2KN*8 + 4K 2 N*16 Total Area 20,000,000 (N=K=32) FA cells Total Time Log 2 (2K) 6 cycles

  19. Area-Time efficient architecture design Area - constrained – single 8-bit multiplier 4K 2 N – cycles (128,000) [3.81 Kbps, 248 FA Cells] Time-constrained 4K 2 N – 8-bit multipliers – log 2 (2K) cycles (6) [83.33 Mbps, 20,000,000 FA Cells] Goal : real-time with minimum area Different parallelism levels for multipliers

  20. Area-Time efficient : Real-time, min. area ( i ) ( i 1 ) T T R R − b * b b * b = + − bb bb L L 0 0 ( i ) ( i 1 ) ( i 1 ) ( i ) ( i ) A A − ( A − * R R ) = − µ − bb br 2K*1 Counters MUX Channel Estimate 2K*1 2K*8 b L *b L b 0 *b 0 T T A (i) A (i-1) R bb 2K*1 2K*1 1*8 2K*8 2K*8 b L b 0 DEMUX Mult MUX 2K*1 2K*1 2K*8 MUX 1*16 Subtract r L 1*1 1*8 M N*8 1*8 U Adder >> Subtract X 1*8 r 0 1*8 1*16 N*8 Load Store R br ( i ) ( i 1 ) H H R R − b * r b * r = + − br br L L 0 0

  21. Area-Time efficient : Hardware used Blocks Quantity Full Adder Complex Total Cells Counter 2K*8 16K - 16K Multiplier 2K*8 128K *2 256K Adders 2K*16 + 32K + 32 *2 64K + 64 2*8 + 1*16 Total Area 10,000 (N=K=32) FA cells Total Time 2KN 2,000 cycles

  22. Outline Background Channel Estimation - An implementation perspective VLSI architectures – Area-constrained, Time-constrained, Area-Time efficient DSP Comparisons and Conclusions

  23. DSP comparisons DSPs unable to exploit bit-level parallelism Inefficient storage of bits Unable to replace bit-multiplications by add/sub. Implementation Clock Full Adder Data Rates Rate Cells 166 MHz - 1.02 Kbps C67 DSP Area 500 MHz 248 3.81 Kbps : : : : 10 4 Area-Time 500 MHz 256 Kbps : : : : 2x10 7 Time 500 MHz 83.33 Mbps

  24. Scalability of architectures Design for maximum number of users in the system Fewer users – turn off functional units to reduce power – reconfigure hardware for higher data rates (FPGA) Investigating K-user design using K/2-user designs. Investigating DSP extensions

  25. Conclusions New estimation scheme – designed from an implementation perspective – bit-streaming, fixed-point architecture – reduced complexity, same error rate performance Real-time architecture designs – exploit bit-level parallelism – area-constrained, time-constrained – real-time with minimum area => Real-time architectures for base-band signal processing

Recommend


More recommend