low latency software ldpc decoders for x86 multi core
play

Low-latency software LDPC decoders for x86 multi-core devices - PowerPoint PPT Presentation

Low-latency software LDPC decoders for x86 multi-core devices Bertrand LE GAL and Christophe JEGO IMS laboratory, CNRS UMR 5218 Digital Circuits and Systems team Bordeaux-INP, University of Bordeaux France IEEE International Workshop on


  1. Low-latency software LDPC decoders for x86 multi-core devices Bertrand LE GAL and Christophe JEGO IMS laboratory, CNRS UMR 5218 Digital Circuits and Systems team Bordeaux-INP, University of Bordeaux 
 France IEEE International Workshop on Signal Processing Systems (SIPS) 
 October 3 rd , 2017 
 Lorient, France firstname.lastname@ims-bordeaux.fr

  2. ∏ / ∏ Historically, software decoders were limited to… Benchmarking of decoding algorithms Validate and compare error correction or code construction techniques code families C(1) C(2) C(m) V(1) V(2) V(3) V(n-1) V(n) Parameter optimization Estimation of hardware decoder performances before development P.Qc P.Qi 2P.Qi 2P.Qi Soft datapath Channel k information SRAM 1 P.Qi LLR Ti RAM LLR IO status bits P P Channel PEs PEs System interface SRAM 2 Channel RAM LLR buf. Channel P P SRAM 3 control Unrolled ROM PEs Global signals MU MU MU MU ALU Frozen Channel memory (LLR Ti) (LLR Ti) (LLR Ti) (LLR Ti) SRAM 4 banks xor NISC 4.P control xor SIMD controller signals -1 matrix Hard RAM S3 RAM S1 datapath P RAM S4 RAM S2 PU PU PU control Processing units with signals their own Reg. Reg. Reg. local registers file file file 2 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

  3. Currently they can fulfill others realtime performance requirements Software decoders are at least as Provide design and fast as many hardware circuits runtime flexibilities Throughputs are higher than 1 Gbps on multi-core or many-core devices. Currently, compatible with some industrial Processing latencies from hundreds use cases. of us or ms are too high. Consecutive frame configurations can be different (N, rate) discarding inter-frame parallelism exploitation [1]. [1] OpenAirInterface 5G software alliance for democratising wireless innovation 3 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

  4. Currently they can fulfill others realtime performance requirements Software decoders are at least as Provide design and fast as many hardware circuits runtime flexibilities Throughputs are higher than 1 Gbps on multi-core or many-core devices. Currently, compatible with some industrial Processing latencies from hundreds use cases. of us or ms are too high. Consecutive frame configurations can be different (N, rate) discarding inter-frame parallelism exploitation [1]. [1] OpenAirInterface 5G software alliance for democratising wireless innovation 3 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

  5. Currently they can fulfill others realtime performance requirements Software decoders are at least as Provide design and fast as many hardware circuits runtime flexibilities Throughputs are higher than 1 Gbps on multi-core or many-core devices. Currently, compatible with some industrial Processing latencies from hundreds use cases. of us or ms are too high. Consecutive frame configurations can be different (N, rate) discarding inter-frame parallelism exploitation [1]. [1] OpenAirInterface 5G software alliance for democratising wireless innovation 3 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

  6. The processing performance of GPU & CPU devices NVIDIA Tegra K1 GPU INTEL Core-i7 processor Multicore device (e.g. INTEL Core-i7 ) GPU devices (e.g NVIDIA Titan GPU ) One chip composed hierarchically of physical processor One chip composed hierarchically of stream processors cores ( 4 ) and SIMD unit ( 1 ). ( 14 ) and cores ( 2688 ). Each stream processor controls a set of cores ( 192 ). During 1 clock cycle, a SIMD instr. can perform 32 computations on 8-bits fixed point data => 32 8b-oper. During 1 clock cycle 2688 floating point operations can be executed. During 1 clock cycle, a physical processor (superscalar) However, more computations are required to hide can perform up to 6 SIMD instr => 192 8b-oper. processing and memory access latencies. During 1 clock cycle, a Core-i7 processor can execute 4 cores x 6 SIMD instr => 768 8b-oper. With 1 to 3 GHz clock frequency, it delivers (theoretically) a high processing performance. 4 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

  7. The structure of standardized LDPC code WIMAX 576 × 288 LDPC code, Z = 24 ๏ Standardized H matrix have a Quasi-Cyclic structure, ➡ Compressed matrix definition, ➡ Z expansion factor, ➡ Shifting coefficients, ๏ This QC structure of H matrix Z × Z shifted 
 ID matrix ➡ Reduces the H memory footprint, Reconstructed H matrix ➡ Limits the data dependency during the decoding making parallel computing easy, ๏ From an hardware point of view, Z factor « enforce », ➡ Z processing units, ➡ Z memory banks, ➡ One or two Z × Z data interleavers. 5 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

  8. The standardized LDPC codes structure k information LLR Ti ๏ Standardized H matrix have a IO status bits System interface Quasi-Cyclic structure, ➡ Compressed matrix definition, control signals MU MU MU MU (LLR Ti) (LLR Ti) (LLR Ti) (LLR Ti) ➡ Z expansion factor, ➡ Shifting coefficients, FSM control controller signals -1 ∏ / ∏ ๏ This QC structure of H matrix PU PU PU PU ➡ Reduces the H memory footprint, control signals Reg. Reg. Reg. Reg. ➡ Limits the data dependency during the file file file file decoding making parallel computing easy, Z elements ๏ From a hardware point of view, Z factor « enforce » the design, Hardware design of a Z decoder ➡ Z processing units, structure is possible even for 
 ➡ Z memory banks, Z = {7, 13, 420} ➡ One or two Z × Z data interleavers. 6 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

  9. Parallelization of the LDPC decoding process (1/3) V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization of CN kernels (intra-frame) - Parallelization is limited due to CN degrees, - Horizontal SIMD processing (bad efficiency), - Necessitate unaligned memory accesses to VNs. C 0 C 1 C 2 C 3 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization across CN kernels - Like in hardware architectures (Q CN of same deg.), - Unaligned memory accesses to VNs, - Need matrix reordering (not always possible: unstructured). C 0 C 1 C 2 C 3 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization across frames V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 - Very regular computation processing (inc. memory), - Not evaluated in hardware architectures (high latency), - Necessitate reordering at the beginning of the C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 decoding. C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 7 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

  10. Parallelization of the LDPC decoding process (2/3) V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization of CN kernels - Parallelization is limited due to CN degrees, - Horizontal SIMD processing (bad efficiency), - Necessitate unaligned memory accesses to VNs. C 0 C 1 C 2 C 3 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization across CN kernels - Like in hardware architectures (Q CN of same deg.), - Unaligned memory accesses to VNs, - Need matrix reordering (not always possible: unstructured). C 0 C 1 C 2 C 3 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 SIMD Parallelization across frames [1] (inter-frame) - Regular computation processing, - High-memory footprint at runtime (buffering), - Decoding processing latency is high (100 us). C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 [1] High-throughput multi-core LDPC decoders based on x86 processor, B. Le Gal and C. Jego, IEEE TPDS, 2016 8 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

  11. Parallelization of the LDPC decoding process (3/3) V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization of CN kernels - Parallelization is limited due to CN degrees, - Horizontal SIMD processing (bad efficiency), - Necessitate unaligned memory accesses to VNs. C 0 C 1 C 2 C 3 Parallelization across CN kernels (intra-frame) V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 - Low latency (like in hardware architectures) - Should be quite efficient (when Z > SIMD width), - Irregular accesses to VNs => performance penalties, - Limited to QC LDPC codes. C 0 C 1 C 2 C 3 Z CNs V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 Parallelization across frames V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 - Very regular computation processing (inc. memory), - Not evaluated in hardware architectures (high latency), - Necessitate reordering at the beginning of the C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 decoding. C 0 C 1 C 2 C 3 C 0 C 1 C 2 C 3 9 B. Le Gal IEEE International Workshop on Signal Processing Systems (SIPS) October 3, 2017

Recommend


More recommend