Apps Behaving Badly Michael Brunton-Spall Lisa van Gelder
The inevitability of failure • Systems will fail • Architect for failure
System independence • Each system should cope on its own • Some systems are critical • Redundancy where necessary • This is not “Scaling”
Core CMS Discussion Apache Django Apache Apache Zeitgeist Apache Java GAE Java MPs Expenses DB Apache EC2
When it all goes wrong
Apply fences • Remove misbehaving servers from load balancers • Turn off expensive features • Make your site go faster at expense of dynamic content
Don’t start with root analysis • You don’t need to know what went wrong • Fix the symptoms first • Then work out cause
Causation analysis for fun and profit • Devs and Ops are good at guessing • Devs and Ops are bad at guessing correctly
How to analyse a failure • Loosly based on “Analysis of Competing Hypothesis” • Written for the NSA
Hypothesis testing • Hard to prove causation • Easy to prove non-correlation • Evidence that this hypothesis is false
Generate lots of hypothesis
How do you get the proof?
Allocate Priorities and Staff
Logs, Logs, Logs, Logs • Trigger a stack dump on hanging servers • backup / copy logs of affected server • JVM log • Stdout • Application log
stack traces, heap dumps, core dumps • Get as much info as possible • Heap dumps can take a long time, so only if necessary
Log analysis is your friend • Simple tools for a simple life • Grep, Cut, Uniq, Sort • find the bit of log you are interested in • calculate duration and order by slowest • Sed, Awk
zgrep "RequestLoggingFilter - Request for.*completed in " $LOGFILE | grep -v " /management/" | cut -d" " - f1,2,3,10,13 > $COMPLETED_REQUESTS_FILE cat $COMPLETED_REQUESTS_FILE | cut -d " " -f5 | sort -nr | uniq -c | awk '{ SUM += $1; print $2, SUM }' > $CUMULATIVE_REQUESTS_AT_OR_ABOVE_RESPONS E_TIME_FILE
Write what you need • Log Analyser • MySQL database • Parses application logs • Can now query database • What DBcalls does this URL make? • What URLS make this DBcall?
It’s everybody’s responsibility • Accessing logs • Database analytics • Building tools to help
Do it ASAP before it happens again. • Crack team starts analysis within minutes if possible • Sometimes crack team is just 1 person
Preventing Emergencies
Core systems vs Periphery systems • Core systems must be reliable and up • Periphery systems may be down • But preferably are not!
What is a microapp • A periphery system • Can be released in isolation • Can be less reliable • Can be less performant • Timeout • Components collapse
Microapps • How we create separation of systems • Similar to SSI’s - HTML placeholders • Powered by HTTP • Load balancers, Proxies, Caching
Switches
Feature switches • Turn on or off features as necessary • HTTP Urls to expose switches • POST not GET • Switch dashboard to see status
Per server or global? • Global requires shared state • Global lets you flick switch once for all servers • Per server is less complex • Lets you turn a feature on for a single server
Simple tools for simple tasks • for x in 01 02 03 04; do curl -d status=off http://server$x/switch/x; done • Now you have global switches :) • As compared to using ZooKeeper
Switchable Microapps • Ability to turn off an entire microapp • Collapse all relevant components • Helpful if microapp is slow
Responsibility and Authority • Do not need to get “approval” to turn off any microapp • Operations team can make judgement calls • Need to ensure app can be bought back ASAP
Emergency Mode
Emergency Mode • Rendering a page takes time • As a news site we have unexpected surges in traffic • We need to be able to trade off dynamic pages for speed • Often one page gets sudden heavy traffic
Page Pressing • Emergency mode needs a bit more omph • Not just in memory cache, but a full page cache • Stored on disk as generated HTML • Served as static files, therefore over 1200 pps
Really cache everything • HTML page is fully generated • Except for microapps • Emergency mode for CMS doesn’t affect microapps • Microapp Cache for microapps
Caching an infinite set • There are lots of pages on guardian site • 1.4 million pieces of content • 25,000 keyword pages • http://www.guardian.co.uk/travel/france +travel/skiing • Can’t cache them all
Cache whats important • Content - when modified • including during emergency mode • Navigation - Every 2 weeks • can force page press • Automatic (eg tag combiners) - Never • Automatic but important - Every 2 weeks
Monitoring • Or how do I know what to turn off?
Always provide stats • Consistent format • Aggregate stats at each level
Indicate where issues are • Check high up in architecture first • Indicates what is causing the problem • Breakdown to next level
Automatic switches • Release valves • Emergency mode • Database off mode
Switch if a threshold is met • If average page response time is higher than threshold • Reset after timeout (say 60 seconds) • Prevents Ping-Pong of switches • Really handy for GC issues, Network issues etc.
Summary
Summary • Expect Failure • Plan for failure • At 4am • Keep it simple • Keep everything independant
Summary • When it does go wrong • Fix the symptoms first • Then find out what actually went wrong • Start straight away • Log everything, all the time
Thank You • Michael Brunton-Spall • Lisa van Gelder • michael.brunton- • lisa.van- spall@guardian.co.uk gelder@guardian.co.uk • @bruntonspall • @techbint Giant Furry Rat - “Lost land of the Volcano”courtesy of BBC natural history unit Panic Button - http://www.flickr.com/photos/trancemist/361935363/ Long Meg Sidings - http://www.flickr.com/photos/ingythewingy/5243875486/ Server Rack - http://www.flickr.com/photos/jamisonjudd/2433102356/ Release Valve - http://www.flickr.com/photos/kayveeinc/4107697872 Ancient Planet - http://www.flickr.com/photos/gsfc/4479185727/ Solor system - http://www.flickr.com/photos/gsfc/4479185727 Gauges - http://www.flickr.com/photos/dgoodphoto/5264024028 Logs - http://www.flickr.com/photos/catzrule/5693655199 Higgs boson - http://www.flickr.com/photos/jurvetson/4233962874 Toolbox - http://www.flickr.com/photos/jrhode/4632887921 Don’t Panic sign used with permission Guardian Team used with permission
Recommend
More recommend