A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur * , Mahesh Mehendale * , and R Govindarajan + Presented by: Sreepathi Pai § * Texas Instruments, + Indian Institute of Science § University of Texas, Austin 6th ACM/SPEC International Conference on Performance Engineering, 2015 1
Talk Outline • Introduction to stacked DRAM Caches • Background (An overview of ANATOMY § ) • ANATOMY-Cache : Modeling Stacked DRAM Cache Organizations • Evaluation • Insights • Conclusions § ANATOMY: An Analytical Model of Memory System Performance (Published in the 2014 ACM international conference on Measurement and modeling of computer 2 systems)
Stacked DRAM • DRAM vertically stacked over the processor die. • Stacked DRAMs offer – High bandwidth – High capacity – Moderately low latency. • Several proposals to organize this large DRAM as a last-level cache. Picture courtesy Bryan Black (From MICRO 2013 Keynote) 3
Processor Orgn. With DRAM Cache L1D Core 0 L1I MetaData on DRAM L1D Core Hit 1 DRAM (Off L1I L2 MetaData Tag- Cache Chip) on SRAM (Vertically ( LLSC ) Pred Main . Stacked) Memory . . Miss L1D Core Memory N Controller L1I 4 Processor with Stacked DRAM
Talk Outline • Introduction to stacked DRAM Caches • Background (An overview of ANATOMY ) • ANATOMY-Cache : Modeling Stacked DRAM Cache Organizations • Evaluation • Insights • Conclusions 5
Overview of a DRAM based memory Control Memory Controller Address Data DIMM Rank Device Bank Columns DRAM Bank Bank Rows Logic Row Buffer 6 Data Read & Write operations
Basic DRAM Operations • ACTIVATE Bring data from DRAM core into the row-buffer • READ/WRITE Perform read/write operations on the contents in the row-buffer • PRECHARGE Store data back to DRAM core (ACTIVATE discharges capacitors), put cells back at neutral voltage Memory Requests M H M PRE ACT RD RD PRE ACT RD Bank Level Parallelism (BLP) Row buffer hits (RBH) are faster and • Parallelism improves performance consume less power • Some switching delays hurt performance 7
ANATOMY – Analytical Model of Memory Two components 1) Queuing Model of Memory – Organizational and Technological characteristics – Workload characteristics used as input 2) Use of Workload Characteristics – Locality and Parallelism in workload’s memory accesses 8
Analytical Model for Memory System Performance Q = /(2µ*(1- )) for M/D/1 queue Multiple M/D/1 Bank Server M/D/1 M/D/1 1 Arrival Rate: Bank Data Address Server Bus Bus Server 2 Server … Q data 1/µ data Bank Q addr 1/µ addr Server Service Time: Service Time: N (RBH*1 + (1-RBH)*3) * Burst_Length * BUS_CYCLE_TIME BUS_CYCLE_TIME Q bank 1/µ bank Service Time: t CL * RBH + (t CL +t PRE +t RCD ) * (1-RBH) Latency = Q addr + Q bank + Q data + 1/µ addr + 1/µ bank + 1/µ data 9
Validation - Model Accuracy 12.5 Latency RBH BLP 7.5 % Error 2.5 Average -2.5 -7.5 -12.5 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Avg • Low Errors in RBH, BLP and Latency Estimation – Average error of 3.9%, 4.2% and 4% • ANATOMY predicts trends accurately 10
Talk Outline • Introduction to stacked DRAM Caches • Background (An overview of ANATOMY ) • ANATOMY-Cache : Modeling Stacked DRAM Cache Organizations • Evaluation • Insights • Conclusions 11
ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory Miss Memory Controller Processor with Stacked DRAM 12
ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off • Arrival Rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory Miss Memory Controller Processor with Stacked DRAM 13
ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off • Arrival Rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory • Tag access time Miss Memory Controller Processor with Stacked DRAM 14
ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off • Arrival Rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory • Tag access time Miss • Cache hit rate Memory Controller Processor with Stacked DRAM 15
ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off • Arrival Rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory • Tag access time Miss • Cache hit rate Memory Controller • Cache RBH Processor with Stacked DRAM 16
ANATOMY-Cache Model Key Parameters that govern performance: Hit DRAM (Off • Arrival Rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory • Tag access time Miss • Cache hit rate Memory Controller • Cache RBH Processor with Stacked DRAM • Cache Miss Penalty 17
Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. ANATOMY Mem 18
Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. • The models are fed by the output of the tag server and each other’s outputs. ANATOMY Mem 19
Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. Predicted Hits • The models are fed by No predictions the output of the tag server and each other’s outputs. – Predicted Cache Hits – No Predictions ANATOMY Mem 20
Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. Predicted Hits • The models are fed by No predictions the output of the tag server and each other’s Line Fills, Writebacks outputs. – Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory ANATOMY Mem 21
Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. Predicted Hits • The models are fed by No predictions the output of the tag server and each other’s Line Fills, Writebacks outputs. Predicted – Predicted Cache Hits Misses – No Predictions – Line fills and write back requests from main memory ANATOMY Mem – Predicted Misses 22
Extending ANATOMY to DRAM Caches ANATOMY Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. Predicted Hits • The models are fed by No predictions the output of the tag server and each other’s Line Fills, Writebacks outputs. Predicted Misses, Line fills and Writebacks – Predicted Cache Hits Misses – No Predictions – Line fills and write back requests from main memory ANATOMY Mem – Predicted Misses – Requests from Cache 23
Extending ANATOMY to DRAM Caches ANATOMY Cache L Cache • Two ANATOMY instances - one for DRAM cache and one for main memory. Predicted Hits • The models are fed by No predictions the output of the tag server and each other’s Line Fills, Writebacks outputs. Predicted Misses, Line fills and Writebacks Misses • We compute the latencies at the cache and memory using ANATOMY . L Mem ANATOMY Mem 24
Obtaining the average LLSC miss penalty • L cache and L mem are combined by to estimate the average LLSC miss penalty. • But first we discuss the estimation of the key parameters that govern L Cache and L Mem . 25
Estimating Key Parameters … • Arrival Rate • Tag access time Hit DRAM (Off • Cache hit rate Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main Stacked) Memory • Cache RBH Miss • Cache Miss Penalty Memory Controller Processor with Stacked DRAM 26
Estimating the Cache Arrival Rate • Arrival Rate at the Cache is a sum of Hit λ several streams of DRAM (Off Cache Tag-Pred L2 Chip) λ ( LLSC ) (Vertically Main accesses. Stacked) Memory Miss Memory Controller Processor with Stacked DRAM 27
Estimating the Cache Arrival Rate • Arrival Rate at the Cache is a sum of Hit λ several streams of DRAM (Off Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main accesses. Stacked) Memory • Predicted Hits Miss Memory Controller Processor with Stacked DRAM 28
Estimating the Cache Arrival Rate • Arrival Rate at the Cache is a sum of Hit λ several streams of DRAM (Off Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main accesses. Stacked) Memory • Predicted Hits Miss Memory Controller • No predictions Processor with Stacked DRAM 29
Estimating the Cache Arrival Rate • Arrival Rate at the Cache is a sum of Hit λ several streams of DRAM (Off Cache Tag-Pred L2 Chip) ( LLSC ) (Vertically Main accesses. Stacked) Memory • Predicted Hits Miss Memory Controller • No predictions Processor with Stacked DRAM • Line fills and writebacks 30
Summarizing the Cache Arrival Rate Request Rate Notes Stream λ * h pred *h cache Predicted Hits λ *(1- h pred ) No They are sent to the cache for predictions tag look-up λ *(1- h cache )*B s Line Fills B s is the cache block size λ *(1- h cache )* w Writebacks w is the fraction of misses that cause write-backs λ cache = λ * h pred *h cache + λ *(1- h pred ) + λ *(1- h cache )*B s + λ *(1- h cache )* w 31
Recommend
More recommend