Computer Generation of Efficient Software Viterbi Decoders Frdric - PowerPoint PPT Presentation

Carnegie Mellon Computer Generation of Efficient Software Viterbi Decoders Frédéric de Mesmay, Srinivas Chellappa, Franz Franchetti, Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Co-Founder SpiralGen, Inc. Sponsors: DARPA DESA program, ONR, NSF-NGS/ITR, NSF-ACR, Mercury, and Intel

Carnegie Mellon Viterbi Decoder  Error correction GSM (TCH/FS)  Forward Error Correction K=5 rate=1/2  Digital cellular (CDMA, GSM), CDMA2000/UMTS/IS-95 modems, satellite/deep space K=9 rate=1/3 communications, 802.11 wireless LANs  Software defined radio (SDR)  Pattern Recognition  Speech recognition NASA Cassini Orbiter:  text recognition K=15 rate=1/6  computational linguistics  bioinformatics SDR requires efficient Viterbi decoder software implementations

Carnegie Mellon Software Defined Radio WiFi transmitter on Intel Atom Dualcore Run time per OFDM symbol [ μ s] vs. data rate [Mbit/s] … 30 Straightforward C code 25 but minimizing op count 8 x 20 15 10 Best standard C code 6.3 x realtime 5 Spiral: computer generated Parallelism: 2 threads 0 4-16 way SIMD 6 12 18 24 30 36 42 48 54 Compilers fail to optimize: 50x

Carnegie Mellon Spiral: Viterbi Software Generation “Click”: Push -button code generation http://www.spiral.net/software/viterbi.html

Carnegie Mellon Spiral: Generated SSE Viterbi Code void viterbi_ccsds(unsigned char *Y, unsigned char *X, unsigned char *syms, unsigned char *dec, unsigned char *Branchtab) { for(int i9 = 0; i9 <= 1026; i9++) { unsigned char a75, a81; int a73, a92; ... a71 = ((__m128i *) X); s18 = *(a71); a72 = (a71 + 2); s19 = *(a72); a73 = (4 * i9); a74 = (syms + a73); a75 = *(a74); a76 = _mm_set1_epi8(a75); a77 = ((__m128i *) Branchtab); a78 = *(a77); a79 = _mm_xor_si128(a76, a78); b6 = (a73 + syms); a80 = (b6 + 1); a81 = *(a80); a82 = _mm_set1_epi8(a81); a83 = (a77 + 2); a84 = *(a83); a85 = _mm_xor_si128(a82, a84); t13 = _mm_avg_epu8(a79,a85); a86 = ((__m128i ) t13); a87 = _mm_srli_epi16(a86, 2); a88 = ((__m128i ) a87); t14 = _mm_and_si128(a88, _mm_set_epi8(63, 63, 63, 63, 63, 63, 63 , 63, 63, 63, 63, 63, 63, 63, 63 , 63)); t15 = _mm_subs_epu8(_mm_set_epi8(63, 63, 63, 63, 63, 63, 63 , 63, 63, 63, 63, 63, 63, 63, 63 , 63), t14); m23 = _mm_adds_epu8(s18, t14); m24 = _mm_adds_epu8(s19, t15); m25 = _mm_adds_epu8(s18, t15); m26 = _mm_adds_epu8(s19, t14); a89 = _mm_min_epu8(m24, m23); ... } ... } “Click”: Push -button code generation http://www.spiral.net/software/viterbi.html

Carnegie Mellon Organization  Spiral  Generating software Viterbi decoders  Performance results  Summary

Carnegie Mellon Automatic Performance Tuning  Current vicious circle: Whenever a new platform comes out, the same functionality needs to be rewritten and reoptimized  Automatic Performance Tuning  BLAS: ATLAS, PHiPAC  Linear algebra: Sparsity/OSKI, Flame  Sorting  Fourier transform: FFTW  Linear transforms (and Viterbi): Spiral  … others New problem class: software Viterbi decoders Proceedings of the IEEE special issue, Feb. 2005

Carnegie Mellon What is Spiral? Traditionally Spiral Approach Spiral Comparable High performance library High performance library performance optimized for given platform optimized for given platform

Carnegie Mellon Idea: Common Abstraction and Rewriting Model: common abstraction = spaces of matching formulas = domain-specific language abstraction abstraction ν p defines rewriting search μ pick algorithm architecture space space Architectural parameter: Kernel: optimization Vector length, problem size, #processors, … algorithm choice

Carnegie Mellon Program Generation in Spiral Problem specification (transform) Spiral: Complete automation of controls Algorithm Generation the implementation and optimization task Algorithm Optimization algorithm Basic ideas: controls Declarative representation Search Implementation of algorithms Code Optimization C code Rewriting systems to generate and optimize Compilation performance algorithms at a high level Compiler Optimizations of abstraction Spiral Fast executable Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

Carnegie Mellon Some Kernels as Operator Formulas Linear Transforms Viterbi Decoding convolutional 11 10 01 01 10 10 11 00 Viterbi 010001 11 10 00 01 10 01 11 00 010001 encoder decoder £ Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR) matched preprocessing interpolation 2D iFFT = £ filtering

Carnegie Mellon Same Approach for Different Paradigms Threading: Vectorization: GPUs: Verilog for FPGAs:

Carnegie Mellon Organization  Spiral  Generating software Viterbi decoders  Performance results  Summary

Carnegie Mellon Structure of Viterbi Decoders State machine stages 1/11 1/01 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0/00 1/10 states 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1/00 1 1 1 1 1 1 1 1 00 11 0 0 0 0 1 1 1 1 0/10 1 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0/11 10 0/01 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 Viterbi trellis (data flow) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Key observation: similarity to Walsh-Hadamard transform (WHT)

Carnegie Mellon Viterbi Language (VL) VL in Backus-Naur Form (BNF) Viterbi decoder forward pass in VL

Carnegie Mellon Compiling VL To Code

Carnegie Mellon Vectorization Through Rewriting Vectorization Rule Set Vectorized Viterbi Decoder

Carnegie Mellon VL Compilation System VL Expression VL Compiler s calar decoder Execution metric spread Target overflow factors Vectorization by Rewriting Architecture VL Compiler Peephole Optimization Vectorized Decoder

Carnegie Mellon Organization  Spiral  Generating Software Viterbi Decoders  Performance results  Summary

Carnegie Mellon Comparison to Hand-Tuned Code Karn’s implementation: hand-written assembly for 4 specific Viterbi codes Spiral 16-way 8-way 4-way scalar Karn 16-way 8-way 4-way scalar Single core of Core2 Extreme (quad-core), 3 GHz, Intel C++ compiler 10.0

Carnegie Mellon Vectorization Speed-Up Single core of Core2 Extreme (quad-core), 3 GHz, Intel C++ compiler 10.0

Carnegie Mellon Data Rate Results Decoders for rate 1/4 Performance (kbit/s) 100,000 16-way 8-way 10,000 4-way scalar 1,000 100 10 1 6 7 8 9 10 11 12 13 14 15 16 Constraint length K Single core of Core2 Extreme (quad-core), 3 GHz, Intel C++ compiler 10.0

Carnegie Mellon Organization  Spiral  Generating Software Viterbi Decoders  Performance results  Summary

Carnegie Mellon Summary  Platforms are powerful yet complicated optimization will stay a hard problem Image: Intel  Automatic generation of Viterbi decoder from high-level specification  Spiral: program generation and autotuning architecture kernel can provide full automation M (») A(µ)  Performance of Spiral’s Viterbi decoders is competitive with expert hand tuning

Carnegie Mellon (Part of the) Spiral Team www.spiral.net

Computer Generation of Efficient Software Viterbi Decoders Frdric - PowerPoint PPT Presentation

Carnegie Mellon Computer Generation of Efficient Software Viterbi Decoders Frdric de Mesmay, Srinivas Chellappa, Franz Franchetti, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Co-Founder SpiralGen, Inc.

Combinational Logic Building Blocks Chapter 6 Combinational Logic Introduction Decoders Basic

Design of Energy-Efficient LDPC Codes and Decoders Elsa Dupraz 16/04/2019 Section 1:

COLOR CODE DECODERS FROM TORIC CODE DECODERS Aleksander Kubica work w/ N. Delfosse

Unit 7 Fundamental Digital Building Blocks: Decoders & Multiplexers 7.2 CHECKERS / DECODERS

Unit 8 Fundamental Digital Building Blocks: Decoders & Multiplexers 9.2 Checkers / Decoders

Efficient Decoders for Polar Codes in 5G: Can Machine Learning Help? Seyyed Ali Hashemi Stanford

Viterbi decoder on STI CELL processor Michal Blaek (blazem2@fel.cvut.cz) Viterbi algorithm

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi Recall Viterbi search Viterbi

Cellular-automaton decoders for topological quantum memories arXiv:1406.2338 Michael Herold 1

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

Living with Continual Failure Ronald L. Rivest Viterbi Professor of EECS MIT, Cambridge, MA

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

Illegitimi non carborundum Ronald L. Rivest Viterbi Professor of EECS MIT, Cambridge, MA CRYPTO

Low-latency software LDPC decoders for x86 multi-core devices Bertrand LE GAL and Christophe JEGO

Approach in ML Architecture" Professor Uri Weiser Viterbi Faculty of Electrical Engineering

A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA Zhihui

Chapter 20 Galaxies 20.1 Islands of Stars And the Foundation of Modern Cosmology Our goals

Recursion, Efficiency, and the Time-Space Trade Off; Selection Sort and Big-Oh Checkout

What is Artificial Intelligence? } Historical definition (Dartmouth Workshop on AI, 1956): The

Deep Reinforcement Learning with a Natural Language Action Space Authors: Ji He, Jianshu Chen,

ASTEROID APPROACH Kerry Snyder 12/10/14 Motivation 2 Credit: NASA Asteroid Approach - Kerry

Samsung Internet for GearVR Laszlo Gombos, Web Platform, Samsung @laszlogombos Samsung GearVR

Modeling Video Traffic Sources for RMCAT Evalua9ons Our experience with the Mozilla web browser

Web Browser Privacy & Security Fan Du CMSC 818D Class Presentation 4/16/2015 1 Outline

Computer Generation of Efficient Software Viterbi Decoders Frdric - PowerPoint PPT Presentation

Carnegie Mellon Computer Generation of Efficient Software Viterbi Decoders Frdric de Mesmay, Srinivas Chellappa, Franz Franchetti, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Co-Founder SpiralGen, Inc.

Combinational Logic Building Blocks Chapter 6 Combinational Logic Introduction Decoders Basic

Design of Energy-Efficient LDPC Codes and Decoders Elsa Dupraz 16/04/2019 Section 1:

COLOR CODE DECODERS FROM TORIC CODE DECODERS Aleksander Kubica work w/ N. Delfosse

Unit 7 Fundamental Digital Building Blocks: Decoders &amp; Multiplexers 7.2 CHECKERS / DECODERS

Unit 8 Fundamental Digital Building Blocks: Decoders &amp; Multiplexers 9.2 Checkers / Decoders

Efficient Decoders for Polar Codes in 5G: Can Machine Learning Help? Seyyed Ali Hashemi Stanford

Viterbi decoder on STI CELL processor Michal Blaek (blazem2@fel.cvut.cz) Viterbi algorithm

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi Recall Viterbi search Viterbi

Cellular-automaton decoders for topological quantum memories arXiv:1406.2338 Michael Herold 1

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

Living with Continual Failure Ronald L. Rivest Viterbi Professor of EECS MIT, Cambridge, MA

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

Illegitimi non carborundum Ronald L. Rivest Viterbi Professor of EECS MIT, Cambridge, MA CRYPTO

Low-latency software LDPC decoders for x86 multi-core devices Bertrand LE GAL and Christophe JEGO

Approach in ML Architecture&quot; Professor Uri Weiser Viterbi Faculty of Electrical Engineering

A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA Zhihui

Chapter 20 Galaxies 20.1 Islands of Stars And the Foundation of Modern Cosmology Our goals

Recursion, Efficiency, and the Time-Space Trade Off; Selection Sort and Big-Oh Checkout

What is Artificial Intelligence? } Historical definition (Dartmouth Workshop on AI, 1956): The

Deep Reinforcement Learning with a Natural Language Action Space Authors: Ji He, Jianshu Chen,

ASTEROID APPROACH Kerry Snyder 12/10/14 Motivation 2 Credit: NASA Asteroid Approach - Kerry

Samsung Internet for GearVR Laszlo Gombos, Web Platform, Samsung @laszlogombos Samsung GearVR

Modeling Video Traffic Sources for RMCAT Evalua9ons Our experience with the Mozilla web browser

Web Browser Privacy &amp; Security Fan Du CMSC 818D Class Presentation 4/16/2015 1 Outline

Unit 7 Fundamental Digital Building Blocks: Decoders & Multiplexers 7.2 CHECKERS / DECODERS

Unit 8 Fundamental Digital Building Blocks: Decoders & Multiplexers 9.2 Checkers / Decoders

Approach in ML Architecture" Professor Uri Weiser Viterbi Faculty of Electrical Engineering

Web Browser Privacy & Security Fan Du CMSC 818D Class Presentation 4/16/2015 1 Outline