Illusions of Certainty What the brain can teach us about software engineering Julie Pitt Co-founder, Order of Magnitude Labs @yakticus
relevant links found here: github.com/yakticus/IllusionsOfCertainty
“For the things we have to learn before we can do them, we learn by doing them.” ― Aristotle, The Nicomachean Ethics
today we will discuss a BIG reason why software projects are unpredictable ➔ how to help computers better understand what we mean ➔ how to make our software systems more resilient ➔ how to better understand what our software systems are doing ➔
life
life: a generative model with an interface to the world senses the world generative model action
survival
working working not working
software as a generative model input the world the code output
software as a generative model input infinite precision the world the code output
misjudging uncertainty in software perception reality
human precision don’t hurt people be nice to people don’t kill humans don’t kill humans keep humans alive respect human life
machine precision don’t hurt people be nice to people don’t kill humans keep humans alive respect human life
machine precision don’t hurt people be nice to people don’t kill humans keep humans alive respect human life
the cliffs of infinite precision the happy path utterly broken
how do we get to this? optimal resilience degraded
ways we can cheat ➔ property tests ➔ remedy-first design ➔ build intuitive insight
property tests
test suite as a generative model y system test suite under test x
individual test cases are often too precise desired behavior tests (“training examples”) software system state space
testing an addition function: F# example ✅ test passes state space credit: http://fsharpforfunandprofit.com/posts/property-based-testing/
overfitting to tests bug state space credit: http://fsharpforfunandprofit.com/posts/property-based-testing/
property tests combat overfitting bug state space credit: http://fsharpforfunandprofit.com/posts/property-based-testing/
property tests: let’s review - test suites are generative models - describe the properties of your system - requires less precision - test the properties
remedy-first design
RESTful service input output { client falls off cliff “status”: “failure” “error”: { GET /api/metadata/12345 “errorCode”: 234 “description”: “database timeout” }
each error has a precise cause endpoint moved read timeout connection pool failover exhausted endpoints expired connect timeout token expired user error key rotation account problem credentials insufficient revoked permissions
remedies are imprecise endpoint moved read timeout REDIRECT connection pool failover RETRY exhausted endpoints expired connect timeout token expired user error RE-AUTHENT DISPLAY key rotation account problem ICATE ERROR credentials insufficient revoked permissions
remedy tells the client how to ease pain {“status”: “failure” remedy “failure”: { (actionable) “action”: “RETRY” “error”: { “errorCode”: 234 “description”: “database timeout” } }
What about failures that weren’t predicted?
failure comes in many forms AWS outage - 2012/10/22 -> DNS change didn’t propagate -> indirectly triggered a latent memory leak -> insufficient alerting; failovers happened too little, too late -> API throttling affected some customers more than others -> many popular internet services down for hours
failure comes in many forms AWS scheduled maintenance - 2014/09/25 -> time-sensitive security update on 10% of EC2 nodes -> required reboot of those nodes -> possible impact to customer applications running on those nodes
failure comes in many forms AWS DynamoDB outage - 2015/09/20 -> DynamoDB failed in us-east-1 region -> dozens of dependent services also failed -> many prominent internet services were taken down for hours
Netflix was prepared
meet simian army - OSS project by Netflix - deliberately cause failures in a controlled manner - e.g., randomly takes down AWS ec2 nodes, datacenter, or region - validate whether the system handles failure
simian army -> cultural change - failure is the norm - simulates the nature of failure and not the cause - we can’t predict all causes of failure
remedy-first design: let’s review - design with remedies in mind - # remedies << # causes - test resilience during business hours - find out what you’re up against when wide awake - use a tool that is agnostic to causes - e.g., simian army
intuitive feedback
is it working?
logs: easy to produce
logs: hard to consume
charts
charts: easier to consume, but still hard
we want the whole picture
solution: leverage our intuition
thought experiment What if your software system’s interactions sounded like cars on the road?
intuitive feedback: let’s review - humans want to know “is it working”? - the tools of today inhibit us from seeing the big picture - we need tools that leverage our intuition - e.g., vizceral & TBD
conclusion
curiosity-driven tests senses system test agent under test (neural network) action
mapping the state space through exploration begin testing random states without expectations state space
mapping the state space through exploration gradually build a model containing expectations state space
mapping the state space through exploration model capable of recognizing anomalies state space
self-healing systems senses telemetry ops agent deployment, (neural network) scaling, failover, etc. action
let’s review working working not working
goal: change the landscape
the end.
links github.com/yakticus/IllusionsOfCertainty
Recommend
More recommend