Outline What have we been doing Recovery Oriented Computing - PowerPoint PPT Presentation

Outline • What have we been doing Recovery Oriented Computing • Motivation for a new Challenge: making things work (including endorsements) Dave Patterson University of California at Berkeley • What have we learned Patterson@cs.berkeley.edu • New Challenge: Recovery-Oriented Computer http://roc.CS.Berkeley.EDU/ • Examples: benchmarks, prototypes September 2001 Slide 1 Slide 2 After 15 year improving Goals,Assumptions of last 15 years Performance • Goal #1: Improve performance • Availability is now a vital metric for servers! • Goal #2: Improve performance – near-100% availability is becoming mandatory • Goal #3: Improve cost-performance » for e-commerce, enterprise apps, online services, ISPs – but, service outages are frequent • Assumptions » 65% of IT managers report that their websites were – Humans are perfect (they don’t make mistakes during unavailable to customers over a 6-month period installation, wiring, upgrade, maintenance or repair) • 25%: 3 or more outages – outage costs are high – Software will eventually be bug free (good programmers write bug-free code) » social effects: negative press, loss of customers who “click over” to competitor – Hardware MTBF is already very large (~100 years between failures), and will continue to increase Slide 3 Source: InternetWeek 4/3/2000 Slide 4

Jim Gray: Trouble-Free Systems Downtime Costs (per Hour) • Brokerage operations $6,450,000 • Manager “What Next? – Sets goals • Credit card authorization $2,600,000 A dozen remaining IT problems” – Sets policy Turing Award Lecture, • Ebay (1 outage 22 hours) $225,000 – Sets budget FCRC, – System does the rest. • Amazon.com $180,000 May 1999 • Everyone is a CIO Jim Gray • Package shipping services $150,000 Microsoft (Chief Information Officer) • Home shopping channel $113,000 • Build a system • Catalog sales center $90,000 – used by millions of people each day • Airline reservation center $89,000 – Administered and managed by a ½ time person. • Cellular service activation $41,000 » On hardware fault, order replacement part » On overload, order additional equipment • On-line network fees $25,000 » Upgrade hardware and software automatically. • ATM service fees $14,000 Source: InternetWeek 4/3/2000 + Fibre Channel: A Comprehensive Introduction , R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research." Slide 5 Slide 6 Lampson: Systems Challenges Hennessy: What Should the “New World” Focus Be? • Availability • Systems that work – Meeting their specs – Both appliance & service – Always available • Maintainability – Adapting to changing environment – Two functions: – Evolving while they run – Made from unreliable components » Enhancing availability by preventing failure – Growing without practical limit » Ease of SW and HW upgrades • Scalability • Credible simulations or analysis – Especially of service “Back to the Future: • Writing good specs Time to Return to Longstanding • Cost “Computer Systems Research • Testing Problems in Computer Systems?” -Past and Future” – per device and per service transaction Keynote address, Keynote address, • Performance FCRC, • Performance 17th SOSP, May 1999 – Understanding when it doesn’t matter Dec. 1999 – Remains important, but its not SPECint John Hennessy Butler Lampson Stanford Microsoft Slide 7 Slide 8

The real scalability problems: AME Total Cost of Ownership (IBM) • Availability – systems should continue to meet quality of service HW goals despite hardware and software failures management • Maintainability 3% Purchase Downtime – systems should require only minimal ongoing human 20% 20% administration, regardless of scale or complexity: Today, cost of maintenance = 10X cost of purchase Administration • Evolutionary Growth Environmental 13% – systems should evolve gracefully in terms of 14% performance, maintainability, and availability as they Backup are grown/upgraded/expanded Restore • These are problems at today’s scales, and will 30% only get worse as systems grow • Administration: all people time • Backup Restore: devices, media, and people time • Environmental: floor space, power, air conditioning Slide 9 Slide 10 Lessons learned from Past Projects Lessons learned from Past Projects for which might help AME for AME • Know how to improve performance (and cost) • Maintenance of machines (with state) expensive – Run system against workload, measure, innovate, repeat – ~5X to 10X cost of HW – Benchmarks standardize workloads, lead to competition, – Stateless machines can be trivial to maintain (Hotmail) evaluate alternatives; turns debates into numbers • System admin primarily keeps system available • Major improvements in Hardware Reliability – System + clever human working during failure = uptime – 1990 Disks 50,000 hour MTBF to 1,200,000 in 2000 – Also plan for growth, software upgrades, configuration, – PC motherboards from 100,000 to 1,000,000 hours fix performance bugs, do backup • Yet Everything has an error rate • Software upgrades necessary, dangerous – Well designed and manufactured HW: >1% fail/year – SW bugs fixed, new features added, but stability? – Well designed and tested SW: > 1 bug / 1000 lines – Admins try to skip upgrades, be the last to use one – Well trained people doing routine tasks: 1%-2% – Well run collocation site (e.g., Exodus): 1 power failure per year, 1 network outage per year Slide 11 Slide 12

Lessons learned from Past Projects Lessons learned from Internet for AME • Realities of Internet service environment: Cause of System Crashes – hardware and software failures are inevitable 100% 15% Other: app, power, 18% 21% » hardware reliability still imperfect 80% network failure 15% » software reliability thwarted by rapid evolution System management: 60% actions + N/problem 53% » Internet system scale exposes second-order failure modes Operating System 50% 69% 40% – system failure modes cannot be modeled or predicted failure 20% Hardware failure 18% » commodity components do not fail cleanly 20% 5% 10% 0% 5% » black-box system design thwarts models (est.) 1985 1993 2001 » unanticipated failures are normal • Failures due to people up, hard to measure – human operators are imperfect – VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01 » human error accounts for ~50% of all system failures – HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%? – How get administrator to admit mistake? (Heisenberg?) Slide 13 Slide 14 Sources: Gray86, Hamilton99, Menn99, Murphy95, Perrow99, Pope86 Lessons learned from Past Projects Learning from other fields: PSTN for AME Number of Outages Minutes of Failure • FCC-collected data on outages in the US public-switched telephone network Human-company – metric: breakdown of customer calls blocked by system outages Human-external (excluding natural disasters). Jan-June 2001 HW failures Human error accounts for Act of Nature 9% 56% of all blocked calls 56% SW failure 22% Vandalism Human-co. Human-ext. 5% Hardware Failure • “Sources of Failure in the Public Switched Software Failure Overload Telephone Network,” Kuhn 47% Vandalism 17% – FCC Records 1992-1994; IEEE Computer, 30:4 (Apr 97) – Overload (not sufficient switching to lower costs) – comparison with 1992-4 data shows that human error is the only another 6% outages, 44% minutes factor that is not improving over time Slide 15 Slide 16

Outline What have we been doing Recovery Oriented Computing - PowerPoint PPT Presentation

Outline What have we been doing Recovery Oriented Computing Motivation for a new Challenge: making things work (including endorsements) Dave Patterson University of California at Berkeley What have we learned

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Supply Chain and Logistics Problems for Emergent and Personalized Requests Jennifer Pazour, Ph.D.

Amateur Radio License Radios, Power, RFI Todays Topics Types of Modulation : Chapter 2

Lawrence Livermore National Laboratory Nuclear Structure and ISOL Facilities Erich Ormand

NMR and SAXS: Two complementary techniques Annalisa Pastore NIMR A bit of NMR history Nuclear

Durable Business Drives Cash Flow and Dividend Growth July 2018 Safe Harbor Language and

Impact: Reduce the likelihood of lead poisoning among families living in three hot spot

Tools for Environmental and Human Rights Defenders in Asia: Demystifying Development Banks

ACCIDENT TOLERANT FUEL DEVELOPMENT Dr. Michael Rushton on behalf of Dr. Simon Middleburgh