computer generation of efficient software viterbi decoders
play

Computer Generation of Efficient Software Viterbi Decoders Frdric - PowerPoint PPT Presentation

Carnegie Mellon Computer Generation of Efficient Software Viterbi Decoders Frdric de Mesmay, Srinivas Chellappa, Franz Franchetti, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Co-Founder SpiralGen, Inc.


  1. Carnegie Mellon Computer Generation of Efficient Software Viterbi Decoders Frédéric de Mesmay, Srinivas Chellappa, Franz Franchetti, Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Co-Founder SpiralGen, Inc. Sponsors: DARPA DESA program, ONR, NSF-NGS/ITR, NSF-ACR, Mercury, and Intel

  2. Carnegie Mellon Viterbi Decoder  Error correction GSM (TCH/FS)  Forward Error Correction K=5 rate=1/2  Digital cellular (CDMA, GSM), CDMA2000/UMTS/IS-95 modems, satellite/deep space K=9 rate=1/3 communications, 802.11 wireless LANs  Software defined radio (SDR)  Pattern Recognition  Speech recognition NASA Cassini Orbiter:  text recognition K=15 rate=1/6  computational linguistics  bioinformatics SDR requires efficient Viterbi decoder software implementations

  3. Carnegie Mellon Software Defined Radio WiFi transmitter on Intel Atom Dualcore Run time per OFDM symbol [ μ s] vs. data rate [Mbit/s] … 30 Straightforward C code 25 but minimizing op count 8 x 20 15 10 Best standard C code 6.3 x realtime 5 Spiral: computer generated Parallelism: 2 threads 0 4-16 way SIMD 6 12 18 24 30 36 42 48 54 Compilers fail to optimize: 50x

  4. Carnegie Mellon Spiral: Viterbi Software Generation “Click”: Push -button code generation http://www.spiral.net/software/viterbi.html

  5. Carnegie Mellon Spiral: Generated SSE Viterbi Code void viterbi_ccsds(unsigned char *Y, unsigned char *X, unsigned char *syms, unsigned char *dec, unsigned char *Branchtab) { for(int i9 = 0; i9 <= 1026; i9++) { unsigned char a75, a81; int a73, a92; ... a71 = ((__m128i *) X); s18 = *(a71); a72 = (a71 + 2); s19 = *(a72); a73 = (4 * i9); a74 = (syms + a73); a75 = *(a74); a76 = _mm_set1_epi8(a75); a77 = ((__m128i *) Branchtab); a78 = *(a77); a79 = _mm_xor_si128(a76, a78); b6 = (a73 + syms); a80 = (b6 + 1); a81 = *(a80); a82 = _mm_set1_epi8(a81); a83 = (a77 + 2); a84 = *(a83); a85 = _mm_xor_si128(a82, a84); t13 = _mm_avg_epu8(a79,a85); a86 = ((__m128i ) t13); a87 = _mm_srli_epi16(a86, 2); a88 = ((__m128i ) a87); t14 = _mm_and_si128(a88, _mm_set_epi8(63, 63, 63, 63, 63, 63, 63 , 63, 63, 63, 63, 63, 63, 63, 63 , 63)); t15 = _mm_subs_epu8(_mm_set_epi8(63, 63, 63, 63, 63, 63, 63 , 63, 63, 63, 63, 63, 63, 63, 63 , 63), t14); m23 = _mm_adds_epu8(s18, t14); m24 = _mm_adds_epu8(s19, t15); m25 = _mm_adds_epu8(s18, t15); m26 = _mm_adds_epu8(s19, t14); a89 = _mm_min_epu8(m24, m23); ... } ... } “Click”: Push -button code generation http://www.spiral.net/software/viterbi.html

  6. Carnegie Mellon Organization  Spiral  Generating software Viterbi decoders  Performance results  Summary

  7. Carnegie Mellon Organization  Spiral  Generating software Viterbi decoders  Performance results  Summary

  8. Carnegie Mellon Automatic Performance Tuning  Current vicious circle: Whenever a new platform comes out, the same functionality needs to be rewritten and reoptimized  Automatic Performance Tuning  BLAS: ATLAS, PHiPAC  Linear algebra: Sparsity/OSKI, Flame  Sorting  Fourier transform: FFTW  Linear transforms (and Viterbi): Spiral  … others New problem class: software Viterbi decoders Proceedings of the IEEE special issue, Feb. 2005

  9. Carnegie Mellon What is Spiral? Traditionally Spiral Approach Spiral Comparable High performance library High performance library performance optimized for given platform optimized for given platform

  10. Carnegie Mellon Idea: Common Abstraction and Rewriting Model: common abstraction = spaces of matching formulas = domain-specific language abstraction abstraction ν p defines rewriting search μ pick algorithm architecture space space Architectural parameter: Kernel: optimization Vector length, problem size, #processors, … algorithm choice

  11. Carnegie Mellon Program Generation in Spiral Problem specification (transform) Spiral: Complete automation of controls Algorithm Generation the implementation and optimization task Algorithm Optimization algorithm Basic ideas: controls Declarative representation Search Implementation of algorithms Code Optimization C code Rewriting systems to generate and optimize Compilation performance algorithms at a high level Compiler Optimizations of abstraction Spiral Fast executable Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

  12. Carnegie Mellon Some Kernels as Operator Formulas Linear Transforms Viterbi Decoding convolutional 11 10 01 01 10 10 11 00 Viterbi 010001 11 10 00 01 10 01 11 00 010001 encoder decoder £ Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR) matched preprocessing interpolation 2D iFFT = £ filtering

  13. Carnegie Mellon Same Approach for Different Paradigms Threading: Vectorization: GPUs: Verilog for FPGAs:

  14. Carnegie Mellon Organization  Spiral  Generating software Viterbi decoders  Performance results  Summary

  15. Carnegie Mellon Structure of Viterbi Decoders State machine stages 1/11 1/01 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0/00 1/10 states 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1/00 1 1 1 1 1 1 1 1 00 11 0 0 0 0 1 1 1 1 0/10 1 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0/11 10 0/01 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 Viterbi trellis (data flow) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Key observation: similarity to Walsh-Hadamard transform (WHT)

  16. Carnegie Mellon Viterbi Language (VL) VL in Backus-Naur Form (BNF) Viterbi decoder forward pass in VL

  17. Carnegie Mellon Compiling VL To Code

  18. Carnegie Mellon Vectorization Through Rewriting Vectorization Rule Set Vectorized Viterbi Decoder

  19. Carnegie Mellon VL Compilation System VL Expression VL Compiler s calar decoder Execution metric spread Target overflow factors Vectorization by Rewriting Architecture VL Compiler Peephole Optimization Vectorized Decoder

  20. Carnegie Mellon Organization  Spiral  Generating Software Viterbi Decoders  Performance results  Summary

  21. Carnegie Mellon Comparison to Hand-Tuned Code Karn’s implementation: hand-written assembly for 4 specific Viterbi codes Spiral 16-way 8-way 4-way scalar Karn 16-way 8-way 4-way scalar Single core of Core2 Extreme (quad-core), 3 GHz, Intel C++ compiler 10.0

  22. Carnegie Mellon Vectorization Speed-Up Single core of Core2 Extreme (quad-core), 3 GHz, Intel C++ compiler 10.0

  23. Carnegie Mellon Data Rate Results Decoders for rate 1/4 Performance (kbit/s) 100,000 16-way 8-way 10,000 4-way scalar 1,000 100 10 1 6 7 8 9 10 11 12 13 14 15 16 Constraint length K Single core of Core2 Extreme (quad-core), 3 GHz, Intel C++ compiler 10.0

  24. Carnegie Mellon Organization  Spiral  Generating Software Viterbi Decoders  Performance results  Summary

  25. Carnegie Mellon Summary  Platforms are powerful yet complicated optimization will stay a hard problem Image: Intel  Automatic generation of Viterbi decoder from high-level specification  Spiral: program generation and autotuning architecture kernel can provide full automation M (») A(µ)  Performance of Spiral’s Viterbi decoders is competitive with expert hand tuning

  26. Carnegie Mellon (Part of the) Spiral Team www.spiral.net

Recommend


More recommend