Performance Model of DRAM Caches Authors: Nagendra Gulur * , Mahesh - PowerPoint PPT Presentation

A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur * , Mahesh Mehendale * , and R Govindarajan + Presented by: Sreepathi Pai § * Texas Instruments, + Indian Institute of Science § University of Texas, Austin 6th ACM/SPEC International Conference on Performance Engineering, 2015 1

Talk Outline • Introduction to stacked DRAM Caches • Background (An overview of ANATOMY § ) • ANATOMY-Cache : Modeling Stacked DRAM Cache Organizations • Evaluation • Insights • Conclusions § ANATOMY: An Analytical Model of Memory System Performance (Published in the 2014 ACM international conference on Measurement and modeling of computer 2 systems)

Stacked DRAM • DRAM vertically stacked over the processor die. • Stacked DRAMs offer – High bandwidth – High capacity – Moderately low latency. • Several proposals to organize this large DRAM as a last-level cache. Picture courtesy Bryan Black (From MICRO 2013 Keynote) 3

Processor Orgn. With DRAM Cache L1D Core 0 L1I MetaData on DRAM L1D Core Hit 1 DRAM (Off L1I L2 MetaData Tag- Cache Chip) on SRAM (Vertically ( LLSC ) Pred Main . Stacked) Memory . . Miss L1D Core Memory N Controller L1I 4 Processor with Stacked DRAM

Talk Outline • Introduction to stacked DRAM Caches • Background (An overview of ANATOMY ) • ANATOMY-Cache : Modeling Stacked DRAM Cache Organizations • Evaluation • Insights • Conclusions 5

Overview of a DRAM based memory Control Memory Controller Address Data DIMM Rank Device Bank Columns DRAM Bank Bank Rows Logic Row Buffer 6 Data Read & Write operations

Basic DRAM Operations • ACTIVATE  Bring data from DRAM core into the row-buffer • READ/WRITE  Perform read/write operations on the contents in the row-buffer • PRECHARGE  Store data back to DRAM core (ACTIVATE discharges capacitors), put cells back at neutral voltage Memory Requests M H M PRE ACT RD RD PRE ACT RD Bank Level Parallelism (BLP) Row buffer hits (RBH) are faster and • Parallelism improves performance consume less power • Some switching delays hurt performance 7

ANATOMY – Analytical Model of Memory Two components 1) Queuing Model of Memory – Organizational and Technological characteristics – Workload characteristics used as input 2) Use of Workload Characteristics – Locality and Parallelism in workload’s memory accesses 8

Analytical Model for Memory System Performance Q =  /(2µ*(1-  )) for M/D/1 queue Multiple M/D/1 Bank Server M/D/1 M/D/1 1 Arrival Rate:  Bank Data Address Server Bus Bus Server 2 Server … Q data 1/µ data Bank Q addr 1/µ addr Server Service Time: Service Time: N (RBH*1 + (1-RBH)*3) * Burst_Length * BUS_CYCLE_TIME BUS_CYCLE_TIME Q bank 1/µ bank Service Time: t CL * RBH + (t CL +t PRE +t RCD ) * (1-RBH) Latency = Q addr + Q bank + Q data + 1/µ addr + 1/µ bank + 1/µ data 9

Validation - Model Accuracy 12.5 Latency RBH BLP 7.5 % Error 2.5 Average -2.5 -7.5 -12.5 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Avg • Low Errors in RBH, BLP and Latency Estimation – Average error of 3.9%, 4.2% and 4% • ANATOMY predicts trends accurately 10

Talk Outline • Introduction to stacked DRAM Caches • Background (An overview of ANATOMY ) • ANATOMY-Cache : Modeling Stacked DRAM Cache Organizations • Evaluation • Insights • Conclusions 11

ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory Miss Memory Controller Processor with Stacked DRAM 12

ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off • Arrival Rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory Miss Memory Controller Processor with Stacked DRAM 13

ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off • Arrival Rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory • Tag access time Miss Memory Controller Processor with Stacked DRAM 14

ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off • Arrival Rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory • Tag access time Miss • Cache hit rate Memory Controller Processor with Stacked DRAM 15

ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off • Arrival Rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory • Tag access time Miss • Cache hit rate Memory Controller • Cache RBH Processor with Stacked DRAM 16

ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off • Arrival Rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory • Tag access time Miss • Cache hit rate Memory Controller • Cache RBH Processor with Stacked DRAM • Cache Miss Penalty 17

Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. ANATOMY Mem 18

Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. • The models are fed by the output of the tag server and each other’s outputs. ANATOMY Mem 19

Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. Predicted Hits • The models are fed by No predictions the output of the tag server and each other’s outputs. – Predicted Cache Hits – No Predictions ANATOMY Mem 20

Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. Predicted Hits • The models are fed by No predictions the output of the tag server and each other’s Line Fills, Writebacks outputs. – Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory ANATOMY Mem 21

Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. Predicted Hits • The models are fed by No predictions the output of the tag server and each other’s Line Fills, Writebacks outputs. Predicted – Predicted Cache Hits Misses – No Predictions – Line fills and write back requests from main memory ANATOMY Mem – Predicted Misses 22

Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. Predicted Hits • The models are fed by No predictions the output of the tag server and each other’s Line Fills, Writebacks outputs. Predicted Misses, Line fills and Writebacks – Predicted Cache Hits Misses – No Predictions – Line fills and write back requests from main memory ANATOMY Mem – Predicted Misses – Requests from Cache 23

Extending ANATOMY to DRAM Caches ANATOMY Cache L Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. Predicted Hits • The models are fed by No predictions the output of the tag server and each other’s Line Fills, Writebacks outputs. Predicted Misses, Line fills and Writebacks Misses • We compute the latencies at the cache and memory using ANATOMY . L Mem ANATOMY Mem 24

Obtaining the average LLSC miss penalty • L cache and L mem are combined by to estimate the average LLSC miss penalty. • But first we discuss the estimation of the key parameters that govern L Cache and L Mem . 25

Estimating Key Parameters … • Arrival Rate • Tag access time Hit DRAM (Off • Cache hit rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory • Cache RBH Miss • Cache Miss Penalty Memory Controller Processor with Stacked DRAM 26

Estimating the Cache Arrival Rate • Arrival Rate at the Cache is a sum of Hit λ several streams of DRAM (Off Cache Tag-Pred L2 Chip) λ ( LLSC ) (Vertically Main accesses. Stacked) Memory Miss Memory Controller Processor with Stacked DRAM 27

Estimating the Cache Arrival Rate • Arrival Rate at the Cache is a sum of Hit λ several streams of DRAM (Off Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main accesses. Stacked) Memory • Predicted Hits Miss Memory Controller Processor with Stacked DRAM 28

Estimating the Cache Arrival Rate • Arrival Rate at the Cache is a sum of Hit λ several streams of DRAM (Off Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main accesses. Stacked) Memory • Predicted Hits Miss Memory Controller • No predictions Processor with Stacked DRAM 29

Estimating the Cache Arrival Rate • Arrival Rate at the Cache is a sum of Hit λ several streams of DRAM (Off Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main accesses. Stacked) Memory • Predicted Hits Miss Memory Controller • No predictions Processor with Stacked DRAM • Line fills and writebacks 30

Summarizing the Cache Arrival Rate Request Rate Notes Stream λ * h pred *h cache Predicted Hits λ *(1- h pred ) No They are sent to the cache for predictions tag look-up λ *(1- h cache )*B s Line Fills B s is the cache block size λ *(1- h cache )* w Writebacks w is the fraction of misses that cause write-backs λ cache = λ * h pred *h cache + λ *(1- h pred ) + λ *(1- h cache )*B s + λ *(1- h cache )* w 31

Performance Model of DRAM Caches Authors: Nagendra Gulur * , Mahesh - PowerPoint PPT Presentation

A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur * , Mahesh Mehendale * , and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science University of Texas, Austin

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

2018 2019 Demand Response Auction Mechanism ( DRAM DRAM 3) 3) Pre Bi Pre Bid

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit

Main Memory and DRAM Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Main Memory and DRAM Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture SRAM vs.

DRAM CONTROLLER Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Viyojit: Decoupling Battery and DRAM Capacities for Battery-Backed DRAM Rajat Kateja # Anirudh

DRAM 1 Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

DRAM Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

DRAM CONTROLLER Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory

Analyzing the Performance Benefit of Near-Memory Acceleration based on Commodity DRAM Devices

Memory Hierarchy Instructor: Jun Yang 1 11/19/2009 Motivation Processor-DRAM Memory Gap

Frog Lab: Anatomy and Contraction Bioengineering/Physiology 3202 Frog Lab I Slide 1

PCI DSS Compliance Training Matthew Packard, CCEP | Internal Auditing and Compliance

deSEO: Combating Search-Result Poisoning John P John Fang Yu, Yinglian Xie, Arvind

Acoustic location of Bragg peak for hadrontherapy monitoring Jorge Otero, Miguel Ardid, Ivan

Hebrews 6:9, But, beloved, we are confident of better things concerning you yes things of

of God Personification 0 A common figure of speech in the Bible and everyday speech. 0 Genesis

The heart of the matter GODS ANATOMY OF OUR HEARTS 6 of 13 The exposed heart HEBREWS

Why Im NOT Why Im NOT Jewish/ Christian Atheist Agnostic Hindu Muslim Buddhist