motivation motivation
play

Motivation Motivation Extreme static Extreme static & & - PDF document

Microelectronic System Design Research Group System Design Research Group Microelectronic University Kaiserslautern University Kaiserslautern www.eit.uni- -kl.de kl.de/wehn /wehn www.eit.uni A Case Case Study Study in in Reliability


  1. Microelectronic System Design Research Group System Design Research Group Microelectronic University Kaiserslautern University Kaiserslautern www.eit.uni- -kl.de kl.de/wehn /wehn www.eit.uni A Case Case Study Study in in Reliability Reliability- -Aware Aware Design: Design: A A Resilient Resilient LDPC Decoder LDPC Decoder Architecture Architecture A Norbert Wehn Wehn Norbert MPSoC 07 Awaji Island, Hyogo, Japan This work was partially sponsored by the June 2007 BMBF Initiative „Autonome Integrierte Systeme“ Motivation Motivation Extreme static Extreme static & & dynamic dynamic variations variations will will result result in in unreliable unreliable components components • • How to How to build build reliable reliable systems systems with with „ „physical layer physical layer“ “ ? ? • • Resilient architectures tolerating variabilty and sporadic errors Resilient architectures tolerating variabilty and sporadic error s • • α -particle Cosmic neutrons 1

  2. Case Study Study: LDPC Decoder : LDPC Decoder Case Emerging Killer Applications Emerging Killer Applications • • – – Recognition, Mining, Synthesis (RMS) Recognition, Mining, Synthesis (RMS) – – Probabilistic belief propagation algorithms Probabilistic belief propagation algorithms LDPC decoding representative for RMS algorithms LDPC decoding representative for RMS algorithms • • – Hot topic in wireless communications (WiMAX, DVB Hot topic in wireless communications (WiMAX, DVB- -S2, WiFi, space applications) S2, WiFi, space applications) – – – High troughput, low latency requirements, large flexibility High troughput, low latency requirements, large flexibility Communication and Communication and memory memory centric centric architecture architecture • • Sources of Sources of unreliability unreliability • • – – E.g E.g. . timing timing errors errors in in communication communication network network due due to cross to cross talk talk and and voltage voltage noise noise – E.g – E.g. soft . soft errors errors in in memories and communication network memories and communication network Goal: Goal: Increase Increase LDPC LDPC decoder reliability decoder reliability for for a a given given system system performance performance with with minimum minimum hardware overhead and throughput degradation hardware overhead and throughput degradation Error Resilient LDPC Decoder Error Resilient LDPC Decoder LDPC Decoding Application Algorithm Architecture Subblocks Unreliable physical layer (transistor, circuit) Large design space for resilient architectures Large design space for resilient architectures • • – Spatial – Spatial- - and time and time redundancy redundancy e.g. e.g. TRM TRM (space), ARQ (space), ARQ (time) (time) – Error Error detection/correction codes detection/correction codes e.g e.g. . CRC, Hamming CRC, Hamming codes codes – Application resilience (probabilistic & iterative) Application resilience (probabilistic & iterative) • • 2

  3. Algorithm/Architecture/EDC Codesign Algorithm/Architecture/EDC Codesign ALGORITHM: investigation ALGORITHM: investigation w.r.t w.r.t. . fault fault- -tolerance, error tolerance, error sensitivity e.g. sensitivity e.g. • • – – Single/two Single/two phase belief propagation, layered phase belief propagation, layered belief belief propagation algorithms propagation algorithms – Sum – Sum- -Product Product, 3 , 3- -min, min, Min Min- -Sum Sum ARCHITECTURE: select ARCHITECTURE: select robust robust architecture e.g. architecture e.g. • • – Single Single- -Phase, Phase, Two Two- -Phase Phase – – Sign Sign- -magnitude, 2K magnitude, 2K – – – Critical signals Critical signals SUBBLOCK SUBBLOCK: : identify identify „ „reliability reliability sensitivity sensitivity“ “ for for each each subblock subblock • • – Select Select appropriate appropriate technique technique for for each each subblock subblock to to increase increase SYSTEM SYSTEM reliability reliability – All steps All steps are are strongly strongly interrelated interrelated! ! UKL LDPC Decoder Implementations Implementations UKL LDPC Decoder LDPC Code LDPC Code DVB DVB- -S2 S2 WiMax WiMax WiFi WiFi U- U -S S LDPC LDPC (802.16e) (802.11n) (UWB) (802.16e) (802.11n) (UWB) Codeword Size Codeword Size 64800 64800 576- 576 -2304 2304 648, 1296, 1944 648, 1296, 1944 9600 9600 Code Rate 1/4- -9/10 9/10 1/2- -5/6 5/6 1/2- -5/6 5/6 3/4 Code Rate 1/4 1/2 1/2 3/4 90 360 24- -96 96 27- -81 81 80 Parallelism 90 360 24 27 80 Parallelism 6 bit Quantization 6 bit Quantization 3- -Min Min MinSum+MSF/Lay /Lay Algorithm 3 MinSum+MSF Algorithm . . 50- 50 -15 15 25- 25 -20 20 25- 25 -20 20 7 7 Max. Iterations Max. Iterations Architecture Architecture 1 1- -phase phase PN branch PN branch Combined Combined 1 1- -phase phase Layered Layered Area [mm Area [mm 2 2 ] ] 65nm 65nm @ 400 MHz @ 400 MHz @ 528 MHz @ 528 MHz 0.130 0.130 0.217 0.217 0.110 0.110 0.096 0.096 0 0 VNP VNP 0.328 1.200 0.470 0.395 0.212 CNP 0.328 1.200 0.470 0.395 0.212 CNP 0.046 0.270 0.206 0.065 0.027 Network 0.046 0.270 0.206 0.065 0.027 Network 3.357 4.428 0.551 0.467 0.265 Memory 3.357 4.428 0.551 0.467 0.265 Memory 3.86 6.11 1.33 1.02 0.50 Overall Area 3.86 6.11 1.33 1.02 0.50 Overall Area 60- 60 -708 Mbps 708 Mbps 0.23 0.23- -2.68 2.68 Gbps Gbps 48 48- -333 Mbps 333 Mbps 54- 54 -281 Mbps 281 Mbps 1.63 Gbps 1.63 Gbps Net Throughput Net Throughput 270 270- -82 82 µ µs s 69- 69 -21 21 µ µs s 6.0- 6.0 -5.7 5.7 µ µs s 6.0- 6.0 -5.8 5.8 µ µs s 4.4 µ 4.4 µs s Latency Latency Max. Efficiency Max. Efficiency 183 Mbps / 183 Mbps / mm mm 2 2 430 Mbps / m 430 Mbps / mm m 2 2 250 Mbps / mm 250 Mbps / mm 2 2 274 Mbps / mm 274 Mbps / mm 2 2 3.2 Gbps 3.2 Gbps / / mm mm 2 2 0.15 0.15- -1.77 1.77 0.58- 0.58 -6.70 6.70 0.12- 0.12 -0.83 0.83 0.14- 0.14 -0.70 0.70 3.08 3.08 Infobit/Cycle Infobit /Cycle Selected WiMax Standard as case study Selected WiMax Standard as case study • • 3

  4. Single- -Phase 3 Phase 3- -Min Min Algorithm Algorithm Single + Channel Sum MSG Sum RAM RAM 1 RAM … RAM 2 VFU VFU Controller + VFU Permutation Network Π … CFU CFU CFU Permutation Network Π -1 Soft Errors in Memories Soft Errors in Memories + ENC2 Channel Sum MSG Sum RAM RAM 1 RAM … RAM 2 VFU VFU ED/ PUN(0) Controller + VFU Permutation Network Π … CFU CFU CFU Permutation Network Π -1 Encoding (ENC2): MSB of Encoding (ENC2): MSB of channel channel values values doubled doubled • • Error detection: detection: Comparison Comparison of of the the MSB MSB Error • • Error Error correction: Puncturing i.e. channel correction: Puncturing i.e. channel values values are are set set to 0 to 0 (algorithmic fault tol.) (algorithmic fault tol.) • • 4

  5. Channel RAM RAM Channel MTBF (bit flipping) = 2ms = 10 15 FIT Message RAM Message RAM + ENC2 Channel Sum MSG Sum RAM RAM 1 RAM … RAM 2 VFU VFU ED/PUN(0) Controller + VFU Permutation Network Π … CFU CFU CFU Permutation Network Π -1 5

  6. Message RAM Message RAM Inherent fault fault tolerance tolerance of of the the belief belief propagation propagation algorithm algorithm Inherent • • Permutation Network Network: Soft and Timing Errors : Soft and Timing Errors Permutation ED/PUN(0) ED/PUN(0) ED/PUN(0) … VFU VFU VFU ENC2 ENC2 ENC2 Controller Permutation Network Π CFU CFU CFU ED ED ED … PUN(0) PUN (0) PUN (0) ENC2 ENC2 ENC2 Permutation Network Π -1 6

  7. Data Representation Data Representation K2 versus Sign/Magnitude: K2 versus Sign/Magnitude : Sign/Magnitude reduces Sign/Magnitude reduces power and power and noise noise • • Algorithmic fault tolerance Algorithmic fault tolerance Only sign is important! Only sign is important! 10 -4 : 90000 bits/iteration => 9 bits/iteration Permutation Networks Permutation Networks sign_i magn_i rst 5 XOR par_i Permutation Network Π sign_i magn_i rst par_i 6 XOR CFU_i Encoding Encoding • • – Sign – Sign bit bit doubled, doubled, toggle redundant sign toggle redundant sign every every clock clock cycle cycle ( (timing timing errors) errors) Error Error detection detection and and correction correction • • – Error Error in in input input message message: all : all output output messages messages of of this this check check node node are are set set to 0 to 0 – 7

  8. Permutation Networks Permutation Networks Check Nodes Check Nodes … VFU VFU VFU Permutation Network Π Controller CFU sign sign magn control calc calc calc … CFU CFU EDC control ED control PUN(0) Permutation Network Π -1 Encoding Encoding: : sign sign calculation calculation doubled doubled / / controller controller tripled tripled • • Error correction correction Error • • – – Message puncturing: reuse PUN unit of errors in permutation netw Message puncturing: reuse PUN unit of errors in permutation network ork – Controller: 2 out of 3 Controller: 2 out of 3 voter voter – 8

  9. Check Nodes Nodes Check Putting all together Putting all together 9

Recommend


More recommend