monkeys in lab coats
play

Monkeys in Lab Coats Automating Failure Testing Research at The - PowerPoint PPT Presentation

Monkeys in Lab Coats Automating Failure Testing Research at The whole is greater than the sum of its parts. - Aristotle [Metaphysics] The Professor vs The Practitioner Peter Alvaro Kolton Andrus Ex-Berkeley, Ex-Industry Ex-Netflix,


  1. Monkeys in Lab Coats Automating Failure Testing Research at

  2. The whole is greater than the sum of its parts. - Aristotle [Metaphysics]

  3. The Professor vs The Practitioner Peter Alvaro Kolton Andrus Ex-Berkeley, Ex-Industry Ex-Netflix, Ex-Amazon Assistant Prof @ Santa Cruz ‘Chaos’ Engineer Misses the calm of PhD life Misses his actual pager Likes prototyping stuff Likes breaking stuff

  4. Measures of Success Academic Industry H-Index Availability (i.e. 99.99% uptime) Grant warchest Number of Incidents Department ranking Reduce Operational Burden

  5. An Unlikely Team?

  6. Works Great! but ... it’s manual

  7. Surely there is a better way ...

  8. Free lunch?

  9. The End? (Academia + Industry)

  10. Let’s build it “Can we, pretty please?”

  11. Freedom and Responsibility Core Value

  12. Responsibility Academic Industry Prove that it works Show that it scales Find real bugs

  13. Lineage Driven The Big Idea Fault Injection

  14. What could possibly go wrong? Consider computation Search Space: involving 100 services 2 100 executions

  15. “Depth” of bugs Single Faults Search Space: 100 executions

  16. “Depth” of bugs Combination of 4 faults Search Space: 3M executions

  17. “Depth” of bugs Combination of 7 faults Search Space: 16B executions

  18. Random Search Search Space: 2 100 executions

  19. Engineer-guided Search Search Space: ???

  20. Fault-tolerance “is just” redundancy

  21. How do we find the redundancy? Could a bad ‘thing’ ever happen? Why did a good ‘thing’ happen?

  22. Lineage-driven fault injection The write is stable Why did a good thing happen? Stored on Stored on Consider its lineage . RepA RepB Bcast1 Bcast2 Client Client

  23. Lineage-driven fault injection The write is stable Why did a good thing happen? Stored on Stored on Consider its lineage . RepA RepB What could have gone wrong? Faults are cuts in the lineage graph. Bcast1 Bcast2 Is there a cut that breaks all supports? Client Client

  24. Lineage-driven fault injection The write is stable Why did a good thing happen? Stored on Stored on Consider its lineage . RepA RepB What could have gone wrong? Faults are cuts in the lineage graph. Bcast1 Bcast2 Is there a cut that breaks all supports? Client Client

  25. What would have to go wrong? The write is stable (RepA OR Bcast1) Stored on Stored on RepA RepB Bcast1 Bcast2 Client Client

  26. What would have to go wrong? The write is stable (RepA OR Bcast1) Stored on Stored on AND (RepA OR Bcast2) RepA RepB Bcast1 Bcast2 Client Client

  27. What would have to go wrong? The write is stable (RepA OR Bcast1) Stored on Stored on AND (RepA OR Bcast2) RepA RepB AND (RepB OR Bcast2) Bcast1 Bcast2 Client Client

  28. What would have to go wrong? The write is stable (RepA OR Bcast1) Stored on Stored on AND (RepA OR Bcast2) RepA RepB AND (RepB OR Bcast2) AND (RepB OR Bcast1) Bcast1 Bcast2 Client Client

  29. Lineage-driven fault injection The write is stable Stored on Stored on RepA RepB Hypothesis: {Bcast1, Bcast2} Bcast1 Bcast2 Client Client

  30. Search Space Reduction Each Experiment finds Reduces the a bug, OR Search space

  31. The prototype system “Molly” Recipe: 4. REPEAT 1. Start with a successful outcome. Work backwards. 1. Success Fail 2. Ask why it happened: Lineage 3. Convert lineage to a boolean formula and solve 4. Lather, rinse, repeat Why? Solve Encode 2. Lineage 3. CNF

  32. The Big Idea Meets Production

  33. 1. Start with a successful outcome 4. REPEAT 1. Success Fail Why? Solve Encode 2. 3. Lineage CNF

  34. What is success?

  35. “Start with the customer and work backwards” Leadership Principle

  36. Lesson 1 Work backwards from what you know

  37. 2. Ask why it happened 4. REPEAT 1. Success Fail Why? Solve Encode 2. 3. Lineage CNF

  38. Request Tracing

  39. Request Tracing

  40. Alternate Execution

  41. Evolution over time

  42. Redundancy through History

  43. Lesson 2 Meet in the middle

  44. 3. Solve 4. REPEAT 1. Success Fail Why? Solve Encode 2. 3. Lineage CNF

  45. A “small” matter of code

  46. 4. Lather, Rinse, Repeat 4. REPEAT 1. Success Fail Why? Solve Encode 2. 3. Lineage CNF

  47. Turn the crank, right?

  48. Idempotence

  49. Bins and Balls Class 1 Request Class 2 r’ r Class 3 [...] Class n

  50. Predicting Request Graphs Request Class n

  51. Predicting Request Graphs Request Class n Some function f: Requests → Classes

  52. Predicting Request Graphs Class n F( ) = Request

  53. Solve the Machine Learning problem? or the Failure Testing one?

  54. Simplest thing that will work?

  55. Falcor Path Mapping ["bookmarks”, “recent”] ["playlist", 0, “name”] ["ratings"] => “bookmarks,playlist,ratings”

  56. Lesson 3 Adapt the theory to the reality

  57. Many moons passed...

  58. YES! Does it work?

  59. Case study: “Netflix AppBoot” Services ~100 2 100 (1,000,000,000,000,000,000,000,000,000,000) Search space (executions) Experiments performed 200 Critical bugs found 11

  60. Future Work Search prioritization Richer device metrics Richer lineage collection Request class creation Exploring temporal Better experiment selection interleavings

  61. Lessons Work backwards from what you know Meet in the middle Adapt the theory to the reality

  62. Academia + Industry

  63. Academia + Industry Academia Industry

  64. Thank You! Peter Alvaro Kolton Andrus @palvaro @KoltonAndrus palvaro@ucsc.edu kolton@gremlininc.com

  65. References ● Netflix Blog on ‘Automated Failure Testing’ http://techblog.netflix.com/2016/01/automated-failure-testing.html ● Netflix Blog on ‘Failure Injection Testing’ techblog.netflix.com/2014/10/fit-failure-injection-testing.html ● ‘Lineage Driven Fault Injection’ http://people.ucsc.edu/~palvaro/molly.pdf ● ‘Automating Failure Testing Research at Scale’ https://people.ucsc.edu/~palvaro/socc16.pdf

  66. Photo Credits ● http://etc.usf.edu/clipart/4000/4048/children_7_lg.gif ● http://cdn.c.photoshelter.com/img-get2/I0000MIN8fL0q8AA/fit=1000x750/taiw an-hiking-river-tracing-walking.jpg ● http://i.imgur.com/iWKad22.jpg ● https://blogs.endjin.com/2014/05/event-stream-manipulation-using-rx-part-2/ ● http://youpivot.com/category/features/ ● https://www.cloudave.com/33427/boards-need-evolve-time/ ● https://www.linkedin.com/pulse/amelia-packager-missing-data-imputation-ram prakash-veluchamy

Recommend


More recommend