XED:%EXPOSING%ON,DIE%ERROR% DETECTION%INFORMATION%FOR% STRONG%MEMORY%RELIABILITY Prashant%Nair,%Georgia%Tech Vilas&Sridharan ," AMD&Inc.&&&&& Moinuddin Qureshi,&Georgia Tech ISCA%43,)June)20 th 2016 Seoul,)Republic)of)Korea
INTRODUCTION DRAM&Scaling& ! High&Capacity&Memories Two&types&of&DRAM&faults Scaling&Faults 100 H 80 Aspect Ratio of Storage Node 60 b Aspect Ratio = H/b 40 20 Aspect Ratio Source: S. J. Hong (Hynix), IEDM 2010 70 60 50 40 30 20 10 Technology Node (nm) Figure 2: Exponential increase in aspect ratio of DRAM cells [ArchShield ISCA’13,%CiDRA HPCA’15] with scaling to smaller technology nodes (redrawn from [5]) 2
INTRODUCTION DRAM&Scaling& ! High&Capacity&Memories Two&types&of&DRAM&faults Scaling&Faults Runtime&Faults Fault& Transient& Permanent& 100 Mode Fault Rate&(FIT) Fault Rate&(FIT) H 80 Aspect Ratio of Storage Node Bit 14.2 18.6 60 b Word 1.4 0.3 Aspect Ratio = H/b Column 1.4 5.6 40 Row 0.2 8.2 20 Aspect Ratio Bank 0.8 10 Source: S. J. Hong (Hynix), IEDM 2010 70 60 50 40 30 20 10 *Total 18 42.7 Technology Node (nm) Figure 2: Exponential increase in aspect ratio of DRAM cells [ArchShield ISCA’13,%CiDRA HPCA’15] Sridharan et.%al.%SC13 with scaling to smaller technology nodes (redrawn from [5]) 3
ON,DIE%ECC:%MITIGATE%SCALING%FAULTS DRAM&vendors&plan&to&use&“OnWDie&ECC”& • Mitigates&scaling&faults&transparently • Enables&good&DIMM&with&bad&chips (yield) • Part&of:&LPDDR4,&DDR4,&DDR5&(proposed) 4
ON,DIE%ECC:%MITIGATE%SCALING%FAULTS DATA x8%DIMM CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP REQUEST,A 5
ON,DIE%ECC:%MITIGATE%SCALING%FAULTS DATA ECC x8%DIMM CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP REQUEST,A 6
ON,DIE%ECC:%MITIGATE%SCALING%FAULTS 64Bits 8WBits (72,64)&ECC 64WBits On(Die,ECC:,Single,Error,Correction,,Double, Error,Detection,Code,(SECDED) 7
ON,DIE%ECC:%MITIGATE%SCALING%FAULTS � � � Detect Correct 64WBits&Correct&Data On(Die,ECC,fixes,scaling,faults,invisibly 8
MITIGATING%RUNTIME%FAULTS Fault& Transient& Permanent& Runtime(faults Mode Fault Rate&(FIT) Fault Rate&(FIT) • Chip&faults&common Bit 14.2 18.6 • Need&strong&ECC Word 1.4 0.3 Column 1.4 5.6 Row 0.2 8.2 ECC,DIMM Bank 0.8 10 (9,Chips) *Total 18 42.7 � CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip � 9
MITIGATING%RUNTIME%FAULTS Fault& Transient& Permanent& Runtime(faults Mode Fault Rate&(FIT) Fault Rate&(FIT) • Chip&faults&common Bit 14.2 18.6 • Need&strong&ECC Word 1.4 0.3 Column 1.4 5.6 Row 0.2 8.2 Bank 0.8 10 *Total 18 42.7 CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip 10 *Sridharan+*SC13
MITIGATING%RUNTIME%FAULTS Runtime(chip(faults( ! Chipkill (strong&ECC) READ 18%DRAM%Chips CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip 11
MITIGATING%RUNTIME%FAULTS Runtime(chip(faults( ! Chipkill (strong&ECC) 18%DRAM%Chips CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip Cost: 18 Chips,&Performance&and&Power&Inefficient 12
GOAL%AND%CHALLENGE GOAL:&Use&OnWDie&ECC&to&mitigate&runtime&faults “ChipkillWlevel&reliability&using&x8&ECCWDIMM” CHALLENGE:&OnWDie&ECC&is&invisible,&expose&it& without&changing&the&memory&interface 13
OUTLINE • BACKGROUND • XED • CASE&STUDIES • EVALUATION • SUMMARY 14
USING%PARITY%+%FAILED%LOCATION What&if&the&chip&can&inform&that&it&failed? CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip D0 D1 D2 D3 D4 D5 D6 D7 ECC Memory,Controller 15
USING%PARITY%+%FAILED%LOCATION What&if&the&chip&can&inform&that&it&failed? CHIP CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP Chip D0 D1 FAIL D3 D4 D5 D6 D7 PA Memory,Controller Parity&+&Location& ! Reconstruct&Data&for&Faulty&Chip Fix&chipWfaults&using&only&9&Chips 16
XED:%EXPOSED%ON,DIE%ERROR%DETECTION XED&consists&of&three&components • Strong&detection&in&addition&to&SEC • ParityWbased&correction • Transparently&identifying&faulty&chip 17
XED:%ON,DIE%ECC%AS%DETECTION%CODE OnWDie&Error&Correction&Code Detect Corrects? Detects? SingleWBit&Failures � � Correct Chip&Failures � � 64WBits Data 18
XED:%ON,DIE%ECC%AS%DETECTION%CODE OnWDie&Error&Strong&Detection + Correction&Code Detect Corrects? Detects? SingleWBit&Failures � � Correct Chip&Failures � (99.9%) � CRC,8%ATM,code%instead%of%Hamming,code 64WBits Data OnWDie&ECC&can&detect&chipWfailures 19
XED:%RAID,3%BASED%CORRECTION If&we&could&expose&OnWDie&Error&Detection& ! Chipkill CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP CHIP Chip OnWDie&ECC&detected&it Reconstruct&Data&in&Failed&Chip 20
EXPOSE%ON,DIE%ERROR%INFO OPTION&1:&Use&additional&wires CHIP CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP Chip FAIL D0 D1 D2 D3 D4 D5 D6 D7 PA Memory,Controller 21
EXPOSE%ON,DIE%ERROR%INFO OPTION&1:&Use&additional&wires CHIP CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP Chip Incompatible&with&DDR&memory&standards Failed Needs&a&new&protocol D0 D1 D2 D3 D4 D5 D6 D7 PA Worse&for&pinWconstrained&future&systems! Memory,Controller 22
EXPOSE%ON,DIE%ERROR%INFO OPTION&2:&Use&additional&burst/transaction CHIP CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP Chip D0 D1 D2 D3 D4 D5 D6 D7 PA Memory,Controller 23
EXPOSE%ON,DIE%ERROR%INFO OPTION&2:&Use&additional&burst/transaction CHIP CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP Chip OK FAIL OK OK OK OK OK OK OK D0 D1 D2 D3 D4 D5 D6 D7 PA Memory,Controller 24
EXPOSE%ON,DIE%ERROR%INFO OPTION&2:&Use&additional&burst/transaction CHIP CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP Chip Additional&12.5%&to&100%&bandwidth&overheads Performance&and&Power&Inefficient OK FAIL OK OK OK OK OK OK OK D0 D1 D2 D3 D4 D5 D6 D7 PA Memory,Controller Expose&OnWDie&error&detection&with&minor&changes 25
XED:%ON,DIE%ERROR%INFO%FOR%FREE On&detecting&an&error,&the&DRAM&chip&sends&a&64W bit&“CatchWWord”&(CW)&instead&of&data CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP CHIP Chip 64(bits D0 D1 CW D3 D4 D5 D6 D7 PA Memory,Controller 26
XED:%MUX%TO%SEND%CATCH,WORDS Yes Detect CW Correct 64WBits Data&or&CW Simple&MUX&to&chose&between&Data&and&CatchWWord 27
XED:%ON,DIE%ERROR%INFO%FOR%FREE On&detecting&an&error,&the&DRAM&chip&sends&a&64W bit&“CatchWWord”&(CW)&instead&of&data Chips&provisioned&with&a&unique&CatchWWord& CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP CHIP Chip No&additional&wires/bandwidth&overheads Compatible&with&existing&memory&protocols D0 D1 CW D3 D4 D5 D6 D7 PA Memory,Controller 64(bit,Catch(Words identify,the,faulty,chip 28
WHY%DO%CATCH,WORDS%WORK? Catch&Word&(CW)&≠ Valid&Data&(D2) CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP CHIP Chip 64(bits D0 D1 CW D3 D4 D5 D6 D7 PA 29
WHY%DO%CATCH,WORDS%WORK? Catch&Word&(CW)&≠ Valid&Data&(D2) Then& ! PA&≠ D0& � D1& � CW& � …& � D7 CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP CHIP Chip D0 D1 CW D3 D4 D5 D6 D7 PA Location&Identified 30
WHY%DO%CATCH,WORDS%WORK? Catch&Word&(CW)&≠ Valid&Data&(D2) Then& ! PA&≠ D0& � D1& � CW& � …& � D7 CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP CHIP Chip D0 D1 CW D3 D4 D5 D6 D7 PA Location&Identified D2&=&D0& � D1& � D3& � …& � PA 31
WHY%DO%CATCH,WORDS%WORK? Catch&Word&(CW)&= Valid&Data&(D2) CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP CHIP Chip 64(bits D0 D1 CW D3 D4 D5 D6 D7 PA 32
WHY%DO%CATCH,WORDS%WORK? Catch&Word&(CW)&= Valid&Data&(D2)&[ Collision ] Then& ! PA&= D0& � D1& � CW& � …& � D7 CHIP Parity CHIP CHIP CHIP CHIP CHIP CHIP CHIP Chip D0 D1 CW D3 D4 D5 D6 D7 PA No&Error&as&Parity&Matches Catch(Word,collision:,Doesn’t,affect,correctness 33
Recommend
More recommend