RELIABILITY OF RESISTIVE MEMORIES Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture
Overview ¨ Upcoming deadlines ¤ April 6 th : student paper presentation ¨ This lecture ¤ Hard errors in resistive memories ¤ Increasing reliability by replication, ECP , SAFER, FREE-p ¤ Resistive computing
Recall: Resistive vs. Dynamic RAM ¨ Phase-Change RAM ¨ DRAM ¤ Nonvolatile ¤ Volatile, charge based ¤ Projected to be more ¤ Difficult to further scale scalable down the capacitor ¤ Cells may be written ¤ All of the accesses are individually through row buffer ¤ Slower, with more ¤ Faster, with acceptable energy intensive writes energy consumption ¤ Susceptible to hard ¤ Vulnerable to soft errors errors
Solutions to Memory Hard Errors ¨ Accept failure of some fraction of pages ¤ Map failed pages out of logical memory ¨ Wear-level data pages/blocks, and within blocks ¤ Shift/rotate data randomly (intervals/locations) ¨ Differential writes ¤ Write only cells with values that change ¨ Correct errors when possible ¤ Error correction techniques
Error Correction Techniques ¨ No correction (detection only) ¤ Inefficient ¤ A page must be retired when the first cell fails ¨ SECDED ECC ¤ With a 12.5% memory overhead 8 chips SEC/SECDED 8 bits/chip 7/8 bits 10.9%/12.5% overhead 64 bits
Error Correction Techniques ¨ No correction (detection only) ¤ Inefficient ¤ A page must be retired when the first cell fails ¨ SECDED ECC ¤ With a 12.5% memory overhead ¤ A page must be retired when a block within the page suffers a second error X X
Error Correction Codes ¨ Good for soft errors ¤ Transient errors ¨ Not good for hard errors ¤ ECC has high entropy and can hasten wear-out ¤ Flipping just one data bit changes about half of ECC bits 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Dynamically Replicated Memory ¨ Goal: handle hard errors by pairing two pages that have faults in different locations; replicate data across the two pages ¨ How: errors are detected with parity bits; replica reads are issued if the initial read is faulty [ASPLOS’10]
Dynamically Replicated Memory ¨ Improve the lifetime of PCM by up to 40x over conventional error-detection techniques [ASPLOS’10]
Error Correction Pointers ¨ Key idea: instead of using ECC to handle a few transient faults in DRAM, use error-correcting pointers to handle hard errors in specific locations ¨ For a 512-bit line with 1 failed bit, maintain a 9-bit field to track the failed location and another bit to store the value in that location ¨ Can store multiple such pointers and can recover from faults in the pointers too [ISCA’10]
Error Correction Pointers correction entry replacement cell data cells Full? 0 1 1 0 … 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 511 510 509 508 3 2 1 0 8 7 6 5 4 3 2 1 0 R correction pointer 1 [ISCA’10]
Error Correction Pointers correction entries data cells 0 0 1 1 0 … 1 0 0 1 0000 Full? 5 4 3 2 1 0 511 510 509 508 3 2 1 0 0 0 0 0 0 0 0 1 0 1 8 7 6 5 4 3 2 1 0 R 1 [ISCA’10]
Error Correction Pointers correction entries data cells 0 0 1 1 0 … 1 0 0 1 0010 0001 Full? 5 4 3 2 1 0 511 510 509 508 3 2 1 0 1 1 1 1 1 1 1 0 1 0 8 7 6 5 4 3 2 1 0 R 0 0 0 0 0 0 0 0 1 0 1 8 7 6 5 4 3 2 1 0 R What if correction entry fails? 1 [ISCA’10]
Stuck-At-Fault Error Recovery ¨ Observation: a failed cell with a stuck-at value is still readable ¨ Goal: either write the word or its flipped version so that the failed bit is made to store the stuck-at value ¨ For multi-bit errors, the line can be partitioned such that each partition has a single error ¨ Errors are detected by verifying a write; recently failed bit locations are cached so multiple writes can be avoided [MICRO’10]
Stuck-At-Fault Error Recovery ¨ Three partition candidates in SAFER How to detect two fails? (read the paper) [MICRO’10]
Stuck-At-Fault Error Recovery ¨ Fail recovery [MICRO’10]
Multi-tiered ECC for Hard/Soft Errors ¨ FREE-p: fine-grained remapping with ECC and embedded pointer ¤ Re-use a “dead” 64B block for storing a remap pointer ¤ Architectural techniques to accelerate address remapping ¨ Detection/correction at the memory controller ¤ Allow simple NVRAM devices ¤ Tolerate hard/soft errors in the cell array, periphery, etc. [HPCA’11]
FREE-p ¨ Embed a 64-bit pointer within a faulty block ¤ There are still-functional bits in a faulty block ¤ 1-bit D/P flag per 64B block n Identify a block is remapped or not ¤ Avoid chained remapping n Embed always the FINAL pointer [HPCA’11]
Capacity vs. Lifetime [HPCA’11]
Resistive Computation ¨ Leverage STT-MRAM for energy efficiency ¤ Near-zero leakage power ¤ Low-energy read operation ¨ Goal: selectively migrate on-chip storage and combinational logic to STT-MRAM to reduce power ¤ On-chip storage: caches, TLBs, register files, queues ¤ Combinational logic: lookup-table (LUT) based computing [ISCA’10]
Hybrid CMT Pipeline ¨ Small arrays Pure CMOS STT-MRAM LUTs STT-MRAM Arrays and simple logic in CMOS Inst I$ Fetch Thrd Decode Buf Logic Sel Logic ¨ Large arrays x 8 I-TLB and complex CLK CLK CLK CLK logic in STT- MC 0 Queue Reg MRAM MC 0 Logic File x 8 MC 1 Queue Shared MC 1 Logic CLK L2$ CLK Banks MC 2 Queue x 8 Func Unit MC 2 Logic ALU ST D$ MC 3 Queue Buf FPU MC 3 Logic D-TLB x 8 CLK [ISCA’10]
System Power !"#$%&'"()*&+"*,$%-.)/&#"& !"#$#%"&'()"*&+(*,#-./"0& 0123&!"#$%&'"()*& 1(&2345&!"#$#%"&'()"*& (!!"# (!!"# '!"# '!"# &!"# &!"# %!"# %!"# $!"# $!"# !"# !"# )*+,# ,--.*/0*# )*+,# ,--.*/0*# 1234352#67829# :;<3=>?#67829# 1$# 1(2#345#-162# )7892# [ISCA’10]
System Performance 1 System Throughput Normalized to 0.8 0.6 CMOS 0.4 0.2 0 S N X N T M G Y E T U G M N S K F N I K L A - A D O C M I F A A W R S A E E A L S E E U C M E B T L R S B Q M O O A O E K W H E G C [ISCA’10]
Recommend
More recommend