Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO Panagiota Nikolaou 1 , Yiannakis Sazeides 1 , Lorena Ndreu 1 , Marios Kleanthous 2 1 University of Cyprus , 2 MAP S.Platis MICRO 48, Waikiki, Hawaii , December 5th 2015 P. Nikolaou 1
Today’s Datacenters > 285 Million Sqft [Emerson, 2011] > 510,000 DC in all over the world [Emerson, 2011] Large scale Datacenters: >10,000 commodity servers Many Million $ per month P. Nikolaou MICRO 48, Waikiki, Hawaii 2
Datacenter Cost DRAM DRAM Opex & Opex & Capex Capex Expenses Expenses Other Other 31% 31% Capex Capex Expenses Expenses 49% 49% Other Other Opex Opex Expenses Expenses 20% 20% [Analysis using COST ‐ ET tool, D. Hardy 2013] P. Nikolaou MICRO 48, Waikiki, Hawaii 3
DRAM Protection Cost Data ECC DRAM Protection Opex & Capex DRAM Opex & 8% Capex Expenses 23% Other Capex & Opex Cost 69% [Analysis using COST ‐ ET tool, D. Hardy 2013] P. Nikolaou MICRO 48, Waikiki, Hawaii 4
Do we need DRAM protection? [Borucki, IRPS 2008], [K. Lim ISCA 2009], [Daniel Bowers, “Server Trends”] DRAM FITS/ Server • Google Failure Study [Barroso, 2009] • DRAM large field studies [V. Shridharan 2012, 2013] DRAM protection is essential !! P. Nikolaou MICRO 48, Waikiki, Hawaii 5
DRAM protection choices ChipkillDC ChipkillSC SECDED + ++ +++ Cost Cost Cost +++ ++ + Reliability Reliability Reliability Performance ++ Performance +++ + Performance P. Nikolaou MICRO 48, Waikiki, Hawaii 6
DRAM protection selection Application AMPRA* *Analyzer of Memory Protection and Failures Implications on TCO (AMPRA tool), site: http://www2.cs.ucy.ac.cy/carch/xi/ampra_tco.php P. Nikolaou MICRO 48, Waikiki, Hawaii 7
Our Proposition & our Contribution Application Error Power Reliability Performance … Characteristics protection techniques AMPRA tool DRAM SDC DIMM FIT Model Model Thermal Availability/ Energy Model MTTF Model Model Server TCO Performance DIMM Cost Model Model Model Best DRAM Protection Technique P. Nikolaou MICRO 48, Waikiki, Hawaii 8
Related work • [Y. Luo DSN 2014] Proposes and analyzes cost of a heterogeneous memory protection scheme Differences: – Performance, power implications of memory protection techniques – Co ‐ located services – Datacenter cost • No other related work considers various parameters P. Nikolaou MICRO 48, Waikiki, Hawaii 9
Outline • Proposed Framework (AMPRA tool) • Use Case • Experimental Framework • Results • Conclusions P. Nikolaou MICRO 48, Waikiki, Hawaii 10
Proposed framework (AMPRA tool) ECC technique Server configuration Thermal (#cores, Interleaving type, Model #threads for Online Service #channels, DIMMs/channel) Fits per mode(transient, #threads for Offline Service permanent, physical #threads for Offline Service location) Component ECC technique #threads for Online Service Temperature Device width/size Server configuration Average Utilization (#cores,Interleaving type, Server configuration #devices/DIMM #channels, DIMMs/channel) (#cores,Interleaving type, DIMM DRAM brand #channels, DIMMs/channel) FITS_DUE Component Reference DRAM technology DIMM Temperature Device size FITS_CE Non DRAM component DRAM Grade Device width DIMM Reference MTTF Factor FITS_NDE #devices/DIMM DRAM FIT HW and SW repair options ECC Model and their MTTR DRAM frequency technique Model for proactive Published replacement Data #DC servers Energy Model DIMM DIMM Server Configuration Cost Model DRAM SDC Derated Availability Derating FITS_NDE /MTTF NDE Derating Factor Model Model DIMM Server FITS_SDC Energy DIMM cost Published Data Server configuration (#cores,Interleaving Total extra type, #channels, Component servers DIMMs/channel) MTTF for Server replacement Performance #threads for Online Model Service #threads for Offline Service Reference ECC TCO Utilization Profile per day for technique Performance Model the online service ECC technique Degradation (PD) System configuration DC configuration Published Maintenance model for replacement Data on faulty components #threads for Online Service System #threads for Offline Service TCO Reliability Target Reliability (MTTF_SDC) P. Nikolaou MICRO 48, Waikiki, Hawaii 11
Outline • Proposed Framework (AMPRA tool) • Use Case • Experimental Framework • Results • Conclusions P. Nikolaou MICRO 48, Waikiki, Hawaii 12
Use Case Bandwidth vs. Latency vs. Reliability vs. Power Chipkill with Dual Channel Implementation (ChipkillDC) Chipkill with Single Channel Implementation (ChipkillSC) 16 ECC bits for 128 Data bits ‐ 144 bit codeword 8B 8 8B 7 Data ECC 8B 6 8B 5 64B 72b 72b 72b 72b 8B Block 4 8B 3 8B 2 Memory 8B 1 Controller P. Nikolaou MICRO 48, Waikiki, Hawaii 13
FIT model ChipkillDC : • – Detects all the errors in 2 devices – Corrects all the errors in 1 device ChipkillSC : • – Cannot detect all the errors in 2 devices – Corrects all the errors in 1 device ChipkillDC can provide better Reliability than ChipkillSC P. Nikolaou MICRO 48, Waikiki, Hawaii 14
Performance and Power model How it works: (ChipkillDC) • Read 8B Data ECC Data ECC 8B 8B 72b 72b 72b 72b 8B 64B 8B Block 8B Memory 8B Controller 8B • Requires accessing two DIMMs • Codeword in a single burst • Latency short • Low Bandwidth • High Power Consumption P. Nikolaou MICRO 48, Waikiki, Hawaii 15
Performance and Power model How it works: (ChipkillSC) • Read 8B Data ECC 8B 8B 72b 72b 8B 64B 8B Block 144 bits 8B Memory 8B Controller 8B • Requires accessing one DIMM • Codeword in two bursts • Latency long • High Bandwidth • Less Power Consumption P. Nikolaou MICRO 48, Waikiki, Hawaii 16
Design Space ChipkillSC ChipkillDC What happens with the Cost? Reliability Detect all the errors in 2 devices Cannot detect all the errors in 2 devices Corrects all the errors in 1 device Corrects all the errors in 1 device Bandwidth Access two DIMMs Access one DIMM Latency Codeword in two bursts Codeword in one burst Power Access two DIMMs Access one DIMM • Application characteristics Memory intensive, compute intensive etc . • • Co ‐ running applications P. Nikolaou MICRO 48, Waikiki, Hawaii 17
Online and Offline Services Online Services: High QoS requirements Offline Services: Do not have QoS constrains Co ‐ location: Improve server utilization and reduce TCO Online Online Offline Offline Service Service Service Service Core 0 Core 0 Core 1 Core 1 Core 2 Core 2 Core 3 Core 3 Memory Controller DRAM DRAM P. Nikolaou MICRO 48, Waikiki, Hawaii 18
Outline • Proposed Framework (AMPRA tool) • Use Case • Experimental Framework • Results • Conclusions P. Nikolaou MICRO 48, Waikiki, Hawaii 19
Experimental Framework DIMM Cost Fit Model Public Analytical Data Models Performance, Power and Thermal Model ChipkillSC ChipkillDC ‐ – Advance Lockstep Mode Mode Server Configuration Workloads Intel Xeon E5 ‐ 5620 1. Web Search (QoS requirements) 4 cores per CPU 2. MapReduce: 2 channels per CPU a. 500MB (CPU intensive) b. 49000MB (memory intensive) 1 DIMM per Channel DC Configuration TCO Model Server Modules: 50,000 Extension DC depreciation: 15year COST ‐ ET Tool [D. Hardy 2013] P. Nikolaou MICRO 48, Waikiki, Hawaii 20
DRAM Protection Implications on Performance WS: Web Search MR500: Map Reduce 500MB MR49000: Map Reduce 49000MB P. Nikolaou MICRO 48, Waikiki, Hawaii 21
DRAM Protection Implications on Power WS: Web Search MR500: Map Reduce 500MB MR49000: Map Reduce 49000MB P. Nikolaou MICRO 48, Waikiki, Hawaii 22
DRAM Protection Implications on Cost WS: Web Search MR500: Map Reduce 500MB MR49000: Map Reduce 49000MB Underlines the importance of understanding the usage and characteristics of all • the services to be run in a DC before making memory protection design choices Highlights the need of proposed framework !! • P. Nikolaou MICRO 48, Waikiki, Hawaii 23
Usage • Datacenter designers: Select processor and protection technique • Researchers: Investigate the implications of new ideas related to DRAM failures and DRAM protection techniques • Service providers: Find how to charge for running offline services and to makeup for the increase in TCO due to co ‐ location P. Nikolaou MICRO 48, Waikiki, Hawaii 24
More in the paper • Detailed explanation of each model • DRAM grades and how affect TCO • Results for other protection techniques (SECDED) • Power and performance results for more applications P. Nikolaou MICRO 48, Waikiki, Hawaii 25
Conclusions • DRAM is one of the dominant cost consumers in a DC • Different protection techniques have different TCO implications • Framework to encapsulates all the parameters and tries to determine the cost ‐ effective protection technique for a DC • Highlight the need of the framework – It is not straightforward to decide which DRAM protection technique is best for a DC setup in the lack of this framework P. Nikolaou MICRO 48, Waikiki, Hawaii 26
Future Work • Evaluate TCO for more online and offline services • Explore the cost ‐ benefits of new ECC schemes • Validation of the framework by using detailed logs from a real DC P. Nikolaou MICRO 48, Waikiki, Hawaii 27
AMPRA tool download site: http://www2.cs.ucy.ac.cy/carch/xi/ampra_tco.php P. Nikolaou MICRO 48, Waikiki, Hawaii 28
Recommend
More recommend