ISCA 2010 The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies Jeffrey Stuecheli 1,2 , Dimitris Kaseridis 1 , David Daly 3 , Hillery C. Hunter 3 & Lizy K. John 1 1 ECE Department, The University of Texas at Austin 2 IBM Corp., Austin 3 IBM Thomas J. Watson Research Center Laboratory for Computer Architecture 6/21/2010
Background Memory terminology � Target System: Multi-Core CMP – 8-16 cores (and up) – Shared cache and memory subsystem � Terminology: – Channel/Rank/Chip/Bank � Area of focus: Improving scheduling of memory interface in light of many cores combined with DRAM technology challenges 2 Laboratory for Computer Architecture 6/21/2010
Motivation Memory Wall (Labyrinth) � Traditional concern is read latency – Fixed at ~26 ns � Beyond latency, many parameters are limiters to efficient utilization � Data bus frequency � 2x each DDRx generation – DDR 200-400, DDR2 400-1066, DDR3 800-1600 – But, internal latency is ~constant � Fixed latency – Bank Precharge (50ns, ~7 operations@1066Mhz) – Write � Read (7.5ns, ~2 operations@1066MHz) 3 Laboratory for Computer Architecture 6/21/2010
Motivation Implications � Scheduling efficiency – Reads � Critical path to execution – Writes � Decoupled � Queuing – We need more write buffering (make the most of each opportunity to execute writes) – Not Read buffering due to latency criticality of loads
The Virtual Write Queue � Grow effective write Last Level reordering by an order of Cache Physical Virtual magnitude through a two- Write Write Queue level structure Queue – Writes can only execute out Cache DRAM Cache of physical write queue Cleaner Scheduler Set – Keep physical queue full with a good mix of operations Dirty – Physical write queue MRU LRU becomes staging ground, Way Way covers latency to pull data from the LLC.
VWQ Details
VWQ Details Cache � Memory Writeback Evolution � Forced Writeback: Traditional approach to writeback. � Eager Writeback: Decouple cache fill from writeback with early “eager” writeback of dirty data (Lee, MICRO 2000). � Scheduled Writeback: Our proposal. Place writeback under the control of the memory scheduler. 7 Laboratory for Computer Architecture 6/21/2010
VWQ Details Filling the Physical Write Queue � Key concept: – Relatively few classes of writes: • Rank Classification: Which Rank? • Page Mode: Quality level • Bank conflicts: Avoid writes to same bank, different page – Physical Write Queue Content: • Maintain high quality writes in structure • Keep Writes to each Rank
VWQ Details Address Mapping � Set address of cache contains – All Rank selection bits – All Bank selection bits – Some number of Column bits (address within a DRAM page)
VWQ Details The Cache Cleaner Last Level Cache Physical Virtual Write Write � Goal: fast/efficient search of Queue Queue large LLC directory Cache DRAM Cache � Based around Set State Vector Cleaner Scheduler Set (SSV) � SSV enables – Efficient communication of dirty Dirty lines to be cleaned MRU LRU Way Way � Cleaner will select line based on current physical write Q contents – Keep full with uniform mix of operations to each DRAM resource Set State Vector 10 Laboratory for Computer Architecture 6/21/2010
VWQ Details Read/Write Priority in scheduler � Goal: Defer write operations as long as possible – Forced Writeback : Queuing depth is quite limited. – Eager Writeback : Write queue is always full; how do we know when we must execute writes? – Virtual Write Queue : Monitor overall fullness on a per Rank basis. Much larger effective buffering capability.
Evaluation/Results
Evaluation/Results Bandwidth Improvement Example � From SPEC mcf workload Baseline VWQ 0.9 0.8 0.7 0.6 utilization 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 millions of instructions
Evaluation/Results Virtual Write Queue IPC Gains � Each experiment consists of 8 copies of the same benchmark – IPC was observed to be uniform across cores (symmetrical system was fair) � Improvements in 1,2, and 4 rank systems – Largest improvement with 1 rank due to exposed “Write to Read Same Rank” penalty 25.00% IPC Improvement 1 Rank 20.00% 2 Ranks 15.00% 4 Ranks 10.00% 5.00% 0.00% hmmer quantum mcf omnetpp bzip2 bwaves cactus dealII gcc emsFDTD leslie3d AVG soplex 14 Laboratory for Computer Architecture 6/21/2010
Evaluation/Results Power Reduction Due to Increased Write Page Mode Access � Overall DRAM power reduction is shown 15 Laboratory for Computer Architecture 6/21/2010
Conclusion � Memory scheduling is critical to CMP design � We must leverage all state in the SOC/CMP
Thank You, Questions? Laboratory for Computer Architecture University of Texas Austin & IBM Austin & IBM T. J. Watson Lab 17 Laboratory for Computer Architecture 6/21/2010
Recommend
More recommend