normally off computing for smart systems
play

Normally off computing for smart systems Cache and main memory - PowerPoint PPT Presentation

Normally off computing for smart systems Cache and main memory architecture based on MRAM: application to High Performance Computing and Embedded Systems Lionel Torres Univ . Montpellier, France Merci : G . Sassatelli, A . Gamatie,


  1. « Normally off » computing for smart systems Cache and main memory architecture based on MRAM: application to High Performance Computing and Embedded Systems Lionel Torres – Univ . Montpellier, France Merci à : G . Sassatelli, A . Gamatie, P . Benoit, P . Nouet, D . Novo, G . Dinatale, A . Todri, A . Virazel, L . Latorre, M . Robert, G . Patrigeon, P . Y . Peneau, J . Modad, F . Ouattara, J . Lopes, O . Coi, K . Sevin RTNS 2018

  2. General context Current IC Integration Challenges ● Energy is critical ● We need more and more Performances for applications ● Actual technology limitations ( CMOS ) - Integration is more and more complex 10 9 transistors / cm2 ● Actual Reliability is a problem – X% of the systems encounter an uncorrectable error per year (X ranging from 1 to 5%) Source: Bernie Meyerson, IBM 140 Bi Bipolar CMOS CM IBM ES9000 120 100 aging Watts / cm 2 performance 80 Fujitsu VP2000 ? IBM GP IBM 3090S mature NTT IBM RY5 60 Pentium 4 Fujitsu M-780 IBM RY7 Techno 2 growing Pulsar 40 IBM 3090 IBM RY6 CDC Cyber 205 IBM RY4 embryonic IBM 4381 20 IBM 3081 Apache Fujitsu M380 Merced IBM 370 IBM 3033 Techno 1 IBM 360 Vacuum Pentium II(DSIP) 0 2 1950 1960 1970 1980 1990 2000 2010 time

  3. General context Technology target : CMOS < 20 nm To Transport 1 bit à 1pJ / mm To transport 10 9 data – 1s ( 1Ghz ) à 1pJ / mm x 10 9 = 1mw / mm 64 Bits Bus à 64mw / mm On real IC à several W / cm2 Calcul, Bit transition à 1 aJ Calcul, 10 9 data transition - 1s à 1aJ x 10 9 = 1nw à It is better to ” calculate ” than to “ transport ”’ the information à In computing memory is certainly interesting à Reminder : minimal energy to change 1 bit d ’ information - K . T Ln2 à 2,85 zJ 3

  4. One challenge : the memory ● Today, 50% of the silicon area of IC is memory ● Take care to energy ( static ) ! L2 L1 L1 L2 L2 L1 L1 L2 L3 L2 L1 L1 L2 L1 L2 L1 L2 4

  5. Technology evolution Actual memories: ● SRAM for fast access ● DRAM for applications ● Flash (mass storage) ● … Emerging memories Universal memory: ● Magnetic tunneling junctions “Non-volatile memory” ● Phase change memory • SRAM performance ● Programmable metallization cells • Size of DRAM/Flash • Non-volatility ● OxRRAM • Scalibility ● … Resistance Switching Memory Emerging memories offer non-volatility, speed and endurance => disruption of the memory hierarchy? 5

  6. Spin Technology Conductance of magnetic metal plates is larger in the presence of a magnetic field perpendicular to the current flow William Thomson 1824-1907 Currently known as Anisotropic Magnetoresistance (AMR) Resistance variation attained: 2%-5% in RT 6

  7. Spin Technology Peter Grünberg and Albert Fert 2007 Nobel Prize in Physics ¢ Thin stacks of FM/NM metals have seen a conductance increase of up to 100% when subjected to a magnetic field B. Guinasch et al., 1989 M. N. Baibich et al., 1988 7

  8. Spin Technology . Magnetoresistance Tunnel : (( − *+, " ) ! " = !$ + ∆ ! . " = 3 → ! = !$ " = (53 → ! = !$ + ∆ ! = !2$ ∆ ! ! 2$ − !$ TMR Classique between 150% et 250% ( or /0! % = = ! $ ! $ more ) M. Bowen et al. Nearly total spin polarization… 8

  9. Spin Technology Compatible with CMOS Non - volatile memory Switching time < 1ns writing current < 10uA-100uA density x4 vs SRAM Immune to radiations Samsung demonstrator ( 8 Mbit STT - MRAM ) - 2016 9

  10. Motivations • A way MRAM PCRAM • Go towards non-volatile systems using emerging NVMs FeRAM ReRAM • Current NVMs issues : Speed, Dynamic energy, Reliability … NV Cache Cache Embedded On-chip GPU GPU Non-Volatile MRAM SRAM CPU CPU High performance bus High performance bus Non-volatile Flash Memory DDR DDR eFPGA FPGA Controller Controller Controller Controller External Flash External DRAM External MRAM External MRAM Where and how to place MRAM to: reduce total power consumption ? keep same or get better performance ? 10

  11. Contributions 1. Evaluation of MRAM-based cache memory hierarchy: • Exploration flow and extraction of memory activity • L1 and L2 caches based on STT-MRAM and TAS- MRAM 2. Non-volatile computing • Instant-on/off capability for embedded processor • Analysis and validation of Rollback mechanism 11

  12. MRAM applied to cache NV Cache On-Chip Possible studies SRAM CPU High performance bus Performance comparison New architectures DDR 3D-Stacking Controller capability of ENERG SPEED Y MRAM External DRAM NV Memory Logic layer Non-volatile Cache AREA POTENTIAL APPLICATIONS ? Cache SRAM vs MRAM ¢ fast Hybrid (SRAM) Benefits of MRAM ¢ SRAM/MRAM CPU Low leakage Cache — cache slow High density — (MRAM) Non-volatility — Take advantages of MRAM Mitigate drawbacks of MRAM Low leakage High write latency High density High write energy Non-volatility 12

  13. MRAM applied to cache NVM exploration flow Benchmarks NVM memory array 1. Define the architecture gem5* Modeling (NVSim**, SPICE…) Memory Prototype Latency Full-system simulator § Single/Multi-core 2. Explore MRAM-based cache configurations Architecture level Circuit level § L1, L2, L3, Hybrid… Execution time # Reads / Writes 3. Extract many useful information # Hits / Misses § Runtime, cache energy, cache transactions… Access energy Total L1/L2 energy consumption Static power * N. Binkert et al., “ The gem5 simulator ,” ACM SIGARCH Computer Architecture News, Aug. 2011. ** X. Dong et al., “ NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory ,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Jul. 2012 . 13

  14. MRAM applied to cache Experimental setup From single to multi-core architecture ARMv7 ISA Private L1 instruction/Data Shared L2 (Additional levels of caches possible) Main Memory Core Core Core … 1 2 N L1 I/D L1 I/D L1 I/D Shared L2 45nm 130nm 120nm 45nm DDR3 SRAM SRAM STT-MRAM TAS-MRAM (Baseline) (Baseline) 14 14

  15. MRAM applied to cache Circuit-level analysis: Models (NVSim) & Prototype Area SRAM STT-MRAM 512kB L2 32kB L1 100 Node Technology (mm²) (mm²) SRAM 1.36 0.091 10 45nm Area (mm²) STT-MRAM 0.82 0.117 1 SRAM 9.7 120nm - 0,1 TAS-MRAM 11.7 0,01 8kB 16kB 32kB 64kB 128kB 256kB 512kB 1MB 2MB 4MB ¢ MRAM is denser for large cache capacity ¢ MRAM cell size smaller than that of SRAM ¢ MRAM needs large transistors for write ¢ TAS-MRAM cache larger due to field lines 15

  16. MRAM applied to cache Circuit-level analysis: Models (NVSim) & Prototype Read Write Standby Latency Energy Latency Energy Leakage Technolog Node y (ns) (nJ) (ns) (nJ) (mW) SRAM 4.28 0.27 2.87 0.02 320 /2.2 ≈ /14 2.1 2.5 512kB 45nm STT- L2 cache 2.61 0.28 6.25 0.05 23 MRAM /8 SRAM 5.95 1.05 4.14 0.08 82 120nm TAS- 35 4.62 35 1.96 10 MRAM STT-MRAM ≈ SRAM MRAM > SRAM MRAM << SRAM TAS-MRAM > SRAM Latency Energy Latency Energy Leakage Technolog Node (ns) (nJ) (ns) (nJ) (mW) y 32kB L1 cache SRAM 1.25 0.024 1.05 0.006 22 /7 45nm STT- 1.94 0.095 5.94 0.04 3.3 MRAM MRAM > SRAM MRAM > SRAM MRAM << SRAM 16

  17. MRAM applied to cache Case study Quad-core architecture: ● Frequency 1GHz Core Core Core Core ● ARMv7 ISA 1 2 3 4 ● Private L1 I/D L1 I/D L1 I/D L1 I/D L1 I/D 32 kB 32 kB 32 kB 32 kB ● Shared L2 Shared L2 512 kB ● DDR3 Main memory DDR3 512 MB Benchmarks ● SPLASH-2 ╶ Mostly high performance computing ● PARSEC ╶ Animation, data mining, computer vision, media processing 17

  18. MRAM applied to cache Architecture-level analysis: gem5 Read/Write ratio L2/L1 access ratio Number of accesses Benchmark L1 cache L2 cache ~2 billions SPLASH-2 ~26 millions (0.5 billions/CPU) ~12 billions PARSEC ~16 millions (3 billions /CPU) Static/Dynamic energy ratio L2 à 90% Static energy L1 à 80% 18

  19. MRAM-based L2 Execution time STT-MRAM L2 (45 nm) TAS-MRAM L2 (130 nm) 1,4 Normalized execution time SRAM 1,2 Baseline 1 0,8 0,6 0,4 0,2 0 e e s m t 1 2 1 2 x s t e r k 4 f e e e n n i g e c 6 g f u u t m d r t a a a n l l a a l r s a 2 a o e m e e r u r r r f x r e h t e a f c c i l y c c b v n v o o d m a s a a o k d a b c i e a u l r l t b f s Observations: barnes ocean2 ● STT shows good performance 100 Cache miss rate (%) ╶ L2 has small impact in overall performance 80 ● For TAS, 14% of penalty in average (SPLASH-2) 60 ╶ 40 Depends on applications (Cache miss rate, L1/L2 access ratio) 20 0 Execution time 19 19

  20. MRAM-based L2 Total L2 cache energy consumption STT-MRAM L2 (45 nm) TAS-MRAM L2 (130 nm) SRAM Baseline 1,0 0,8 Normalized L2 energy 0,6 0,4 0,2 0,0 s m t 1 2 e s e k e 1 2 x t r 4 f e e e e c f u u n n i g t 6 g m d r a t n l l a l a a a a o r s 2 a m e r r f e e r u r h t x a r e e f i y c c c l b n c v d v o o s a m a a o k d b c a i a u e l l r b f t s fluidanimate (read) fluidanimate (write) Observations: radix (read) radix (write) (GigaBytes/s) Bandwidth 1,6 ● Up to 90% of gain for STT End of fluidanimate 1,2 End of radix ● From 40% to 90% for TAS 0,8 0,4 ╶ Due to the very low leakage of MRAM-based cache 0 Execution time 20 20

Recommend


More recommend