active server availability active server availability
play

Active Server Availability Active Server Availability Feedback - PowerPoint PPT Presentation

Active Server Availability Active Server Availability Feedback Feedback James Hamilton James Hamilton JamesRH@ @microsoft microsoft.com .com JamesRH Microsoft SQL Server Microsoft SQL Server 2002.06.12 2002.06.12 Agenda Agenda


  1. Active Server Availability Active Server Availability Feedback Feedback James Hamilton James Hamilton JamesRH@ @microsoft microsoft.com .com JamesRH Microsoft SQL Server Microsoft SQL Server 2002.06.12 2002.06.12

  2. Agenda Agenda Availability Availability � � Software complexity Software complexity � � Availability study results Availability study results � � System Failure Reporting (Watson) System Failure Reporting (Watson) � � Goals Goals � � System architecture System architecture � � Operation & mechanisms Operation & mechanisms � � Querying failure data Querying failure data � � Data Collection Agent (DCA) Data Collection Agent (DCA) � � Goals Goals � � System architecture System architecture � � What is tracked? What is tracked? � � Progress & results Progress & results � � 2 2

  3. S/W Complexity S/W Complexity Even server- -side software is BIG: side software is BIG: Even server � � Windows2000: over 50 mloc mloc Windows2000: over 50 � � DB: 1.5+ mloc mloc DB: 1.5+ � � SAP: 37 mloc mloc (4,200 S/W engineers) (4,200 S/W engineers) SAP: 37 � � Tester to Developer ratios often above 1:1 Tester to Developer ratios often above 1:1 � � Quality per unit line only incrementally Quality per unit line only incrementally � � improving improving Current massive testing investment not solving Current massive testing investment not solving � � problem problem New approach needed: New approach needed: � � Assume S/W failure inevitable Assume S/W failure inevitable � � Redundant, self- Redundant, self -healing systems right approach healing systems right approach � � We first need detailed understanding of what is We first need detailed understanding of what is � � causing both downtime causing both downtime 3 3

  4. Availability Study Results Availability Study Results 1985 Tandem study (Gray): 1985 Tandem study (Gray): � � Administration: 42% downtime Administration: 42% downtime � � Software: 25% downtime Software: 25% downtime � � Hardware 18% downtime Hardware 18% downtime � � 1990 Tandem Study (Gray): 1990 Tandem Study (Gray): � � Administration: 15% Administration: 15% � � Software 62% Software 62% � � Most studies have admin contribution much higher Most studies have admin contribution much higher � � Observations: Observations: � � H/W downtime contribution trending to zero H/W downtime contribution trending to zero � � Software & admin costs dominate & growing Software & admin costs dominate & growing � � We’re still looking at 10 to 15 year- -old research old research We’re still looking at 10 to 15 year � � 4 4

  5. Agenda Agenda Availability Availability � � Software complexity Software complexity � � Availability study results Availability study results � � System Failure Reporting (Watson) System Failure Reporting (Watson) � � Goals Goals � � System architecture System architecture � � Operation & mechanisms Operation & mechanisms � � Querying failure data Querying failure data � � Data Collection Agent (DCA) Data Collection Agent (DCA) � � Goals Goals � � System architecture System architecture � � What is tracked? What is tracked? � � Progress & results Progress & results � � 5 5

  6. Watson Goals Watson Goals Instrument SQL Server: Instrument SQL Server: � � Track failures during customer usage Track failures during customer usage � � Report failure & debug data to dev team Report failure & debug data to dev team � � Goal is to fix big ticket issues proactively Goal is to fix big ticket issues proactively � � Instrumented components: Instrumented components: � � Setup Setup � � Core SQL Server engine Core SQL Server engine � � Replication Replication � � OLAP Engine OLAP Engine � � Management tools Management tools � � Also in use by: Also in use by: � � Office (Watson technology owner) Office (Watson technology owner) � � Windows XP Windows XP � � Internet Explorer Internet Explorer � � MSN Explorer MSN Explorer � � Visual Studio 7 Visual Studio 7 � � … … � � 6 6

  7. What data do we collect? What data do we collect? For crashes: Minidumps Minidumps For crashes: � � Stack, System Info, Modules- -loaded, Type of loaded, Type of Stack, System Info, Modules � � Exception, Global/Local variables Exception, Global/Local variables 0- -150k each 150k each 0 � � For setup errors: For setup errors: � � Darwin Log Darwin Log � � setup.exe log setup.exe log � � 2nd Level if needed by bug- -fixing team: fixing team: 2nd Level if needed by bug � � Regkeys, heap, files, file versions, WQL queries , heap, files, file versions, WQL queries Regkeys � � 7 7

  8. Watson user experience: Watson user experience: •Server side is registry key driven rather than UI Server side is registry key driven rather than UI • •Default is “don’t send” Default is “don’t send” • 8 8

  9. Crash Reporting UI Crash Reporting UI •Server side upload events written to event log rather than UI Server side upload events written to event log rather than UI • 9 9

  10. information back to users information back to users � ‘More information’ hyperlink on Watson’s ‘More information’ hyperlink on Watson’s � Thank You dialog can be set to problem- - Thank You dialog can be set to problem specific URL specific URL 10 10

  11. Key Concept: Bucketing Key Concept: Bucketing � Categorize & group failures by certain ‘bucketing Categorize & group failures by certain ‘bucketing � parameters’: parameters’: � Crash: Crash: AppName AppName, , AppVersion AppVersion, , ModuleName ModuleName, , � ModuleVersion, Offset into module… ModuleVersion , Offset into module… � SQL uses stack signatures rather than failing address as SQL uses stack signatures rather than failing address as � buckets buckets � Setup Failures: Setup Failures: ProdCode ProdCode, , ProdVer ProdVer, Action, , Action, ErrNum ErrNum, Err0, , Err0, � Err1, Err2 Err1, Err2 � Why Why bucketize bucketize? ? � � Ability to limit data gathering Ability to limit data gathering � � Per bucket hit counting Per bucket hit counting � � Per bucket server response Per bucket server response � � Custom data gathering Custom data gathering � 11 11

  12. The payoff of bucketing The payoff of bucketing •Small number of S/W failures dominate customer experienced failu Small number of S/W failures dominate customer experienced failures res • 12 12

  13. Watson’s Server Farm Watson’s Server Farm 13 13

  14. Watson Bug Report Query Watson Bug Report Query 14 14

  15. Watson Tracking Data Watson Tracking Data 15 15

  16. Watson Drill Down Watson Drill Down 16 16

  17. Agenda Agenda Availability Availability � � Software complexity Software complexity � � Availability study results Availability study results � � System Failure Reporting (Watson) System Failure Reporting (Watson) � � Goals Goals � � System architecture System architecture � � Operation & mechanisms Operation & mechanisms � � Querying failure data Querying failure data � � Data Collection Agent (DCA) Data Collection Agent (DCA) � � Goals Goals � � System architecture System architecture � � What is tracked? What is tracked? � � Progress & results Progress & results � � 17 17

Recommend


More recommend