go green stay green
play

Go Green, Stay Green Fixing the intermittent failures in your CI - PowerPoint PPT Presentation

Go Green, Stay Green Fixing the intermittent failures in your CI Greg Law, co-founder and CTO https://undo.io From 1990 - 2005 development hardly changed In the last ten years everything has changed Test OK? What does this mean?? 100%


  1. Go Green, Stay Green — Fixing the intermittent failures in your CI Greg Law, co-founder and CTO https://undo.io

  2. From 1990 - 2005 development hardly changed

  3. In the last ten years everything has changed Test OK? What does this mean?? 100% test coverage? (obviously not.) 100% reliable test-suite? Absolutely!

  4. The productivity vs quality tradeoff Productivity Quality

  5. The productivity vs quality tradeoff Productivity Quality

  6. Arithmetic lesson ● 50 tests, once per week 50,000 tests per hour 50,000 tests per hour ● ● ● Half solid, half 99% reliable 25,000 * 0.01 99% reliable, 1% @ 99.9% ● ● ● 25 x 4 = 100 = 250 failures per hour 50,000*0.01*0.001 ● ● ● 1 failure per month 24*7*4.333 = 0.5; 1 failure every hour ● ● ● = 182,000 per month 168/2 ● = 84 failures per week ● 84 * 4.33333 ● = 364 per month ●

  7. The productivity vs quality tradeoff Go green, stay green Going green is the hard bit. ● But it’s essential. ● Productivity Step 1: exclude flaky tests Quality Ever-growing backlog of test that are flaky where no-one understands why.

  8. CI/CD vision assumes reliable/repeatable testing - what to do? Write only deterministic tests? Remove the flaky tests? Fix the flaky tests? Not viable because deterministic Viable, but has the obvious flaw of Gee thanks, great advice(!) tests are unable to catch non- reducing coverage. deterministic errors (e.g. race The flaky tests are often the most conditions). interesting. Excludes fuzz testing and other powerful techniques.

  9. The intermittent test failures kill us 1000’s more tests every hour. Even 0.1% failure rate very bad news. Most of them probably don’t really matter. So we’ll come back to them later, it should be less hectic next week. Ever-growing backlog of test that are flaky where no-one understands why.

  10. Continuous Integration Stress Testing of SAP HANA SAP HANA as an enterprise-class, in-memory database management system • OLTP and OLAP, relational and noSQL functionality in a single system • Complex codebase • Very strict quality and governance processes • Sophisticated continuous integration platform • Large functional and performance test harness (see Rehmann@RDSS 2014) • “Regular“ tests plus highly parallel, multi-user stress tests (PMUT) • Arbitrary database operations (DML, DDL, etc) in parallel • High amount of stress for system resources • Complements other tests with explorative/non-deterministic testing • Similar approaches with other systems („chaos monkey“) •

  11. Software Flight Recording Technology Record Replay Freeze-frame Single-step Single-step program’s at any time backwards forwards execution Find out why the program made the decisions it did

  12. The solution SAP uses Live Recorder from Undo to record multi-user stress test (PMUT) runs • When a failure occurs the recording is kept and handed over to developers to diagnose • Turns the sporadic problem into a 100% reproducible • SAP developers use Live Recorder’s interactive reversible debugger – UndoDB – on the • recording to diagnose the root cause of the problem

  13. Captured in test and diagnosed with Live Recorder ● A number of sporadic memory leaks and memory corruption defects ● Several issues in the networking code, including the incorrect flushing of a receive buffer and sporadically releasing channels in cases of timeout, resulted in queries incorrectly aborting ● Incorrect parallel access to a shared data-structure which resulted in very subtle sporadic problems which were hard to reproduce ● Very sporadic race condition in SAP HANA’s asynchronous garbage collection for in memory table structures during table unloads under heavy system load ● A race condition in SAP HANA’s transaction management cache with the potential of incorrectly reusing cached session data

  14. Questions @gregthelaw https://undo.io/resources/gdb-watchpoint/

Recommend


More recommend