Outline • What have we been doing Recovery Oriented Computing • Motivation for a new Challenge: making things work (including endorsements) Dave Patterson University of California at Berkeley • What have we learned Patterson@cs.berkeley.edu • New Challenge: Recovery-Oriented Computer http://roc.CS.Berkeley.EDU/ • Examples: benchmarks, prototypes September 2001 Slide 1 Slide 2 After 15 year improving Goals,Assumptions of last 15 years Performance • Goal #1: Improve performance • Availability is now a vital metric for servers! • Goal #2: Improve performance – near-100% availability is becoming mandatory • Goal #3: Improve cost-performance » for e-commerce, enterprise apps, online services, ISPs – but, service outages are frequent • Assumptions » 65% of IT managers report that their websites were – Humans are perfect (they don’t make mistakes during unavailable to customers over a 6-month period installation, wiring, upgrade, maintenance or repair) • 25%: 3 or more outages – outage costs are high – Software will eventually be bug free (good programmers write bug-free code) » social effects: negative press, loss of customers who “click over” to competitor – Hardware MTBF is already very large (~100 years between failures), and will continue to increase Slide 3 Source: InternetWeek 4/3/2000 Slide 4
Jim Gray: Trouble-Free Systems Downtime Costs (per Hour) • Brokerage operations $6,450,000 • Manager “What Next? – Sets goals • Credit card authorization $2,600,000 A dozen remaining IT problems” – Sets policy Turing Award Lecture, • Ebay (1 outage 22 hours) $225,000 – Sets budget FCRC, – System does the rest. • Amazon.com $180,000 May 1999 • Everyone is a CIO Jim Gray • Package shipping services $150,000 Microsoft (Chief Information Officer) • Home shopping channel $113,000 • Build a system • Catalog sales center $90,000 – used by millions of people each day • Airline reservation center $89,000 – Administered and managed by a ½ time person. • Cellular service activation $41,000 » On hardware fault, order replacement part » On overload, order additional equipment • On-line network fees $25,000 » Upgrade hardware and software automatically. • ATM service fees $14,000 Source: InternetWeek 4/3/2000 + Fibre Channel: A Comprehensive Introduction , R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research." Slide 5 Slide 6 Lampson: Systems Challenges Hennessy: What Should the “New World” Focus Be? • Availability • Systems that work – Meeting their specs – Both appliance & service – Always available • Maintainability – Adapting to changing environment – Two functions: – Evolving while they run – Made from unreliable components » Enhancing availability by preventing failure – Growing without practical limit » Ease of SW and HW upgrades • Scalability • Credible simulations or analysis – Especially of service “Back to the Future: • Writing good specs Time to Return to Longstanding • Cost “Computer Systems Research • Testing Problems in Computer Systems?” -Past and Future” – per device and per service transaction Keynote address, Keynote address, • Performance FCRC, • Performance 17th SOSP, May 1999 – Understanding when it doesn’t matter Dec. 1999 – Remains important, but its not SPECint John Hennessy Butler Lampson Stanford Microsoft Slide 7 Slide 8
The real scalability problems: AME Total Cost of Ownership (IBM) • Availability – systems should continue to meet quality of service HW goals despite hardware and software failures management • Maintainability 3% Purchase Downtime – systems should require only minimal ongoing human 20% 20% administration, regardless of scale or complexity: Today, cost of maintenance = 10X cost of purchase Administration • Evolutionary Growth Environmental 13% – systems should evolve gracefully in terms of 14% performance, maintainability, and availability as they Backup are grown/upgraded/expanded Restore • These are problems at today’s scales, and will 30% only get worse as systems grow • Administration: all people time • Backup Restore: devices, media, and people time • Environmental: floor space, power, air conditioning Slide 9 Slide 10 Lessons learned from Past Projects Lessons learned from Past Projects for which might help AME for AME • Know how to improve performance (and cost) • Maintenance of machines (with state) expensive – Run system against workload, measure, innovate, repeat – ~5X to 10X cost of HW – Benchmarks standardize workloads, lead to competition, – Stateless machines can be trivial to maintain (Hotmail) evaluate alternatives; turns debates into numbers • System admin primarily keeps system available • Major improvements in Hardware Reliability – System + clever human working during failure = uptime – 1990 Disks 50,000 hour MTBF to 1,200,000 in 2000 – Also plan for growth, software upgrades, configuration, – PC motherboards from 100,000 to 1,000,000 hours fix performance bugs, do backup • Yet Everything has an error rate • Software upgrades necessary, dangerous – Well designed and manufactured HW: >1% fail/year – SW bugs fixed, new features added, but stability? – Well designed and tested SW: > 1 bug / 1000 lines – Admins try to skip upgrades, be the last to use one – Well trained people doing routine tasks: 1%-2% – Well run collocation site (e.g., Exodus): 1 power failure per year, 1 network outage per year Slide 11 Slide 12
Lessons learned from Past Projects Lessons learned from Internet for AME • Realities of Internet service environment: Cause of System Crashes – hardware and software failures are inevitable 100% 15% Other: app, power, 18% 21% » hardware reliability still imperfect 80% network failure 15% » software reliability thwarted by rapid evolution System management: 60% actions + N/problem 53% » Internet system scale exposes second-order failure modes Operating System 50% 69% 40% – system failure modes cannot be modeled or predicted failure 20% Hardware failure 18% » commodity components do not fail cleanly 20% 5% 10% 0% 5% » black-box system design thwarts models (est.) 1985 1993 2001 » unanticipated failures are normal • Failures due to people up, hard to measure – human operators are imperfect – VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01 » human error accounts for ~50% of all system failures – HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%? – How get administrator to admit mistake? (Heisenberg?) Slide 13 Slide 14 Sources: Gray86, Hamilton99, Menn99, Murphy95, Perrow99, Pope86 Lessons learned from Past Projects Learning from other fields: PSTN for AME Number of Outages Minutes of Failure • FCC-collected data on outages in the US public-switched telephone network Human-company – metric: breakdown of customer calls blocked by system outages Human-external (excluding natural disasters). Jan-June 2001 HW failures Human error accounts for Act of Nature 9% 56% of all blocked calls 56% SW failure 22% Vandalism Human-co. Human-ext. 5% Hardware Failure • “Sources of Failure in the Public Switched Software Failure Overload Telephone Network,” Kuhn 47% Vandalism 17% – FCC Records 1992-1994; IEEE Computer, 30:4 (Apr 97) – Overload (not sufficient switching to lower costs) – comparison with 1992-4 data shows that human error is the only another 6% outages, 44% minutes factor that is not improving over time Slide 15 Slide 16
Recommend
More recommend