emerging nvm enabled storage architecture
play

Emerging NVM Enabled Storage Architecture: From Evolution to - PowerPoint PPT Presentation

Emerging NVM Enabled Storage Architecture: From Evolution to Revolution. Yiran Chen Electrical and Computer Engineering University of Pittsburgh Sponsors: NSF, DARPA, AFRL, and HP Labs 1 Outline Introduction Evolution with eNVM:


  1. Emerging NVM Enabled Storage Architecture: From Evolution to Revolution. Yiran Chen Electrical and Computer Engineering University of Pittsburgh Sponsors: NSF, DARPA, AFRL, and HP Labs 1

  2. Outline • Introduction • Evolution with eNVM: – On‐chip high speed storage; – Off‐chip secondary storage; • Revolution with eNVM: – Memristor‐based neuromorphic accelerator • Conclusion 2

  3. Conventional Memory Scaling 10 4 100 DDR1 200-400 Aspect Ratio A/R 80 10 3 3Å DDR3 Mb/Chip Intrinsic difficulty of charge-based 800-1600 60 5Å 10 2 DDR2 Burj Khalifa com puting and storage! EDO 400-800 7Å 50 40 A/R=6 8Å 9Å 10 1 SDRAM 11Å 20 Mbps 133 T OX 2000 2010 1990 60 50 40 30 20 Technology Node 2012 – 2013 2014 – 2015 2016 – 2017 2018 – 2019 38nm ‐ 32nm 29nm ‐ 22nm 22nm ‐ 16nm 16nm ‐ 14nm M: Stacked MIM M: Stacked MIM M: Stacked MIM M: FBRAM, STT‐RAM, RRAM, PCRAM P: Planar P: Planar, HKMG P: Planar P: Planar A: 6F 2 , bWL A: 6F 2 , bWL A: 6F 2 , bBL, LBL, 1T1C(VFET) A: 4F 2 , 1T, 1T1R, 1TMTJ (VFET) G: poly/SiO 2 G: HKMG G: HKMG G: HKMG C: Si C: Si C: Si C: Si V: 1.35V V: 1.2V V: 1.1V V: ~1V Sources: ASML, ITRS, IMEC, Hynix, IBM 3

  4. Emerging Nonvolatile Memory 4

  5. Memory Technologies Comparison NAND SRAM DRAM STT‐RAM PCRAM ReRAM FLASH N 4 ms >10 y >10 y Data Retention N 4 ms 10y >10 y >10 y >10 y 6 Memory Cell (F 2 ) 120‐140 7‐9 4 8 4 <1 0.1 ms 5‐10 ns 5‐10 ns Read Time 0.2 ns 2 ns 0.1 ms 5‐10 ns 12 ns 5‐10 ns <10 ns <50 ns 1/0.1ms <10 ns Write/Erase Time 70 ps 1 ns 1/0.1 ms <10 ns <50 ns <10 ns Number of Rewrites 10 16 10 16 10 5 10 15 10 8 10 15 Power Consumption Low Low High Low Low High Low Low Low Read/Write Leakage Refresh Power Consumption Leakage Refresh None None None None None None other than R/W Current Current Power Power Source: ITRS ERD workshop presentation by Prof. Y. Chen 5

  6. Challenges: • Identifying the evolutional applications that can – Easily and seamlessly integrated into the current memory hierarchy and computing platform; – Fully leverage the advantages of emerging NVM; – Not be easily replaced by other alternative technology or architecture. • Inventing a revolutionary computing and storage architecture that can – Offer a high‐performance, power efficient, and scalable computing model; – Provide a truly seamless integration between computing and memory. 6

  7. Outline • Introduction • Evolution with eNVM: – On‐chip high speed storage; • STT‐RAM based 3D cache for CPU. • Racetrack based register file for GPU. – Off‐chip secondary storage; • Revolution with eNVM: – Memristor‐based neuromorphic accelerator. • Conclusion 7

  8. STT ‐ RAM based 3D cache Spin ‐ Transfer Torque Random Access Memory Bit‐line Writing ‘1’ Writing ‘0’ Free Layer MgO Layer Reference Layer MTJ Word‐line Magnetic tunneling junction Source‐line 1T ‐ 1MTJ STT ‐ RAM Schematic A scalable technology 8

  9. SRAM vs. MRAM (STT ‐ RAM) 3.66mm 2 SRAM 3.30mm 2 MRAM Area (65nm) Capacity/Bank 128KB 512KB Read latency 2.25ns 2.32ns Write latency 2.26ns 11.02ns Read energy 0.90nJ 0.86nJ Write energy 0.80nJ 5.00nJ Cache configurations Leakage power 2MB (16x128KB) SRAM cache 2.09W 8MB (16x512KB) MRAM cache 0.26W • Pros: Low leakage power, high density. • Cons: Long write latency and large write power 9

  10. STT ‐ RAM based 3D cache • Baseline 3D Architecture – Core Layer + Cache Layers. – NUCA caches with NOC connections. Data Migration Cache Bank Cache Cache Bank Bank Vertical Hop Router R R Layer 2 Cache Cache TSV Bank Bank R R Horizontal Hop Core Layer 1 Cache Controller 10 G. Sun, X. Dong, Y. Xie, J. Li, Y. Chen , HPCA, 2009.

  11. STT ‐ RAM based 3D cache • Challenges: long write latency of STT‐RAM. • Solution 1 (S1): Read‐Preemptive Write Buffer. Write is alm ost done. Write just begins. Write Buffer (FIFO) Write Req. Write Op. Read Op. Read Data STT-RAM Caches Read Req. Read Op. Cores Read Data 11

  12. STT ‐ RAM based 3D cache • Solution S2: SRAM‐MRAM Hybrid L2 Cache 31-Way STT-RAM & 32-Way STT-RAM 1-Way SRAM MRAM Bank TSV Core Core Core Core SRAM Bank Core Core Core Core 12

  13. STT ‐ RAM based 3D cache • Result (S1 & S2): – Performance is improved by 4.91% compared with STT‐RAM baseline. – Power consumption is reduced by 73.5%. 2M-SRAM-DNUCA 8M-MRAM-DNUCA 8M Hybrid DNUCA 1 1 0.8 0.8 0.6 Power IPC 0.6 0.4 0.4 0.2 0.2 0 0 13

  14. Outline • Introduction • Evolution with eNVM: – On‐chip high speed storage; • STT‐RAM based 3D cache for CPU. • Racetrack based register file for GPU. – Off‐chip secondary storage; • Revolution with eNVM: – Memristor‐based neuromorphic accelerator. • Conclusion 14

  15. Racetrack for GPU • Racetrack • Racetrack cell: BL WWL RWL SL Reference layer Free layer Pinning layer Pinning layer – Two fixed pinning regions: free region, and fix region – Racetrack‐magnetic track – Write `0’ – Inject current to move cell – Write `1’ – Access port – Read 15

  16. Racetrack for GPU • Benefits from Racetrack: – Extremely small cell size; Warp 0 Warp 0 • Major challenges: Arbitrator Shift Controller – Shifting caused delay/energy. Write/Read/Shifter Driver • Warp register remapping (WRR) – 60.0% RF are allocated during WWL the execution Row Decoder RWL – Non‐optimal warp register …... mapping, max shift distance WWL —8‐cell RWL – WRR, interleaves the warp BL BL BL BL SL SL SL SL registers across the access ports, max shift distance—4‐cell Column Mux Sense Amplifier Arrays 16 M. Mao, W. Wen, Y. Zhang, Y. Chen , H. Li, DAC 2014

  17. Racetrack for GPU • Write buffer – “piggyback‐write” to write back to RF from write buffer; 3 – Rely on the track movement 2 6 triggered by the read requests; 4 8 1 – Positive side‐effect: filter the 5 redundant RF R/W by leveraging RAW and WAW. 9 7 17 To EXE/MEM

  18. Racetrack for GPU • Experiment results: – Baseline: SRAM‐based register files. – Energy reduction: 59%. – Performance improvement: 4%. 18

  19. Outline • Introduction • Evolution with eNVM: – On‐chip high speed storage; – Secondary storage; • PCRAM and NAND hybrid SSD; • Revolution with eNVM: – Memristor‐based neuromorphic accelerator. • Conclusion 19

  20. Hybrid SSD • Memory hierarchy On-chip memory Page mode Erase Unit 1~30 cycles ↓ PN=0, V X Random Off-chip memory access PN=1, V 100~300 cycles erase-before- PN=2, V write (EBW) Solid State Disk … (Flash) ↓ 25K~2M cycles PN=n, V In-place- X update (IPU) 20 Courtesy: Al Fazio (Intel)

  21. PRAM (PCM) Cell • One transistor/diode and one GST (GeSbTe). • In‐place updating (IPU) Top Electrode Top Electrode GST GST Amorphous Crystalline Heater Heater Low resistance: ‘1’ High resistance: ‘0’ Bottom Bottom Electrode Electrode +N +N Substrate Substrate 21

  22. Hybrid SSD • Conventional SSD: FLASH. • Promising candidate: PRAM (Phase change). • To combine benefits of both technologies: – Hybrid SSD. • Two usage: – Performance; – Reliability. 22

  23. Hybrid SSD: performance enhancement Erase Unit 1 Erase Unit 2 Erase Unit 3 PN=0, V PN=0, V PN=1, V PN=1, V PN=2, V PN=2, V PN=2, V PN=2, I … … … PN=n, V PN=n, I PN=n, V PN=n, V Merge Operation (Empty Pages) (time consuming) PN=Page Number; V=Valid; I=Invalid G.Sun, Y. Joo, Y. Chen, Y. Xie, Y.Chen , H. Li, 23 HPCA, 2010. Erase Unit = 128/ 256KB, Page = 512Bytes ~ 8KB

  24. Hybrid SSD: performance enhancement Erase Unit Data In-place Region updating … … NAND flash Log Region … … Data Buffer PRAM in Sector (512Bytes) Memory Physical View Structural View 24 Hybrid Architecture

  25. Different Log Assignments Erase Unit Erase Unit Data Data Region Region Fixed Organize Assignment log pages in group Log Log Region Region Erase Unit Static log assignment Data Region Group log assignment Dynamic log assignment Dynamic Assignment Log Region 25

  26. Hybrid SSD: performance enhancement 26

  27. Outline • Introduction • Evolution with eNVM: – On‐chip high speed storage; – Secondary storage; • Revolution with eNVM: – Memristor‐based neuromorphic accelerator. • Conclusion 27

  28. Computing: Present and Future 10000 Power Density (mW/mm 2 ) Clock Frequency (MHz) 1000 1000 Multi‐core Nuclear Reactor 100 100 Rocket Launch Neural Hot Plate Network 1990 2000 2010 1990 2000 2010 New Trend: Multi‐core, advanced power management, large on‐chip storage. - Future: Heterogeneous system, Brain‐like computing. - Source: CPU DB, Intel 28

  29. Brain – The Most Efficient Computing Machine Neocortex Gray matter 6 layers White matter Signals travel within and Brain: between 15–30B neurons layers Extremely complex organ 4km/mm 3 35w Neuron: Process signals from other neurons. Synapse: Memory Weight signals Neural Network 29

More recommend