Apps Behaving Badly Michael Brunton-Spall Lisa van Gelder The - PowerPoint PPT Presentation
Apps Behaving Badly Michael Brunton-Spall Lisa van Gelder The inevitability of failure Systems will fail Architect for failure System independence Each system should cope on its own Some systems are critical Redundancy where
Apps Behaving Badly Michael Brunton-Spall Lisa van Gelder
The inevitability of failure • Systems will fail • Architect for failure
System independence • Each system should cope on its own • Some systems are critical • Redundancy where necessary • This is not “Scaling”
Core CMS Discussion Apache Django Apache Apache Zeitgeist Apache Java GAE Java MPs Expenses DB Apache EC2
When it all goes wrong
Apply fences • Remove misbehaving servers from load balancers • Turn off expensive features • Make your site go faster at expense of dynamic content
Don’t start with root analysis • You don’t need to know what went wrong • Fix the symptoms first • Then work out cause
Causation analysis for fun and profit • Devs and Ops are good at guessing • Devs and Ops are bad at guessing correctly
How to analyse a failure • Loosly based on “Analysis of Competing Hypothesis” • Written for the NSA
Hypothesis testing • Hard to prove causation • Easy to prove non-correlation • Evidence that this hypothesis is false
Generate lots of hypothesis
How do you get the proof?
Allocate Priorities and Staff
Logs, Logs, Logs, Logs • Trigger a stack dump on hanging servers • backup / copy logs of affected server • JVM log • Stdout • Application log
stack traces, heap dumps, core dumps • Get as much info as possible • Heap dumps can take a long time, so only if necessary
Log analysis is your friend • Simple tools for a simple life • Grep, Cut, Uniq, Sort • find the bit of log you are interested in • calculate duration and order by slowest • Sed, Awk
zgrep "RequestLoggingFilter - Request for.*completed in " $LOGFILE | grep -v " /management/" | cut -d" " - f1,2,3,10,13 > $COMPLETED_REQUESTS_FILE cat $COMPLETED_REQUESTS_FILE | cut -d " " -f5 | sort -nr | uniq -c | awk '{ SUM += $1; print $2, SUM }' > $CUMULATIVE_REQUESTS_AT_OR_ABOVE_RESPONS E_TIME_FILE
Write what you need • Log Analyser • MySQL database • Parses application logs • Can now query database • What DBcalls does this URL make? • What URLS make this DBcall?
It’s everybody’s responsibility • Accessing logs • Database analytics • Building tools to help
Do it ASAP before it happens again. • Crack team starts analysis within minutes if possible • Sometimes crack team is just 1 person
Preventing Emergencies
Core systems vs Periphery systems • Core systems must be reliable and up • Periphery systems may be down • But preferably are not!
What is a microapp • A periphery system • Can be released in isolation • Can be less reliable • Can be less performant • Timeout • Components collapse
Microapps • How we create separation of systems • Similar to SSI’s - HTML placeholders • Powered by HTTP • Load balancers, Proxies, Caching
Switches
Feature switches • Turn on or off features as necessary • HTTP Urls to expose switches • POST not GET • Switch dashboard to see status
Per server or global? • Global requires shared state • Global lets you flick switch once for all servers • Per server is less complex • Lets you turn a feature on for a single server
Simple tools for simple tasks • for x in 01 02 03 04; do curl -d status=off http://server$x/switch/x; done • Now you have global switches :) • As compared to using ZooKeeper
Switchable Microapps • Ability to turn off an entire microapp • Collapse all relevant components • Helpful if microapp is slow
Responsibility and Authority • Do not need to get “approval” to turn off any microapp • Operations team can make judgement calls • Need to ensure app can be bought back ASAP
Emergency Mode
Emergency Mode • Rendering a page takes time • As a news site we have unexpected surges in traffic • We need to be able to trade off dynamic pages for speed • Often one page gets sudden heavy traffic
Page Pressing • Emergency mode needs a bit more omph • Not just in memory cache, but a full page cache • Stored on disk as generated HTML • Served as static files, therefore over 1200 pps
Really cache everything • HTML page is fully generated • Except for microapps • Emergency mode for CMS doesn’t affect microapps • Microapp Cache for microapps
Caching an infinite set • There are lots of pages on guardian site • 1.4 million pieces of content • 25,000 keyword pages • http://www.guardian.co.uk/travel/france +travel/skiing • Can’t cache them all
Cache whats important • Content - when modified • including during emergency mode • Navigation - Every 2 weeks • can force page press • Automatic (eg tag combiners) - Never • Automatic but important - Every 2 weeks
Monitoring • Or how do I know what to turn off?
Always provide stats • Consistent format • Aggregate stats at each level
Indicate where issues are • Check high up in architecture first • Indicates what is causing the problem • Breakdown to next level
Automatic switches • Release valves • Emergency mode • Database off mode
Switch if a threshold is met • If average page response time is higher than threshold • Reset after timeout (say 60 seconds) • Prevents Ping-Pong of switches • Really handy for GC issues, Network issues etc.
Summary
Summary • Expect Failure • Plan for failure • At 4am • Keep it simple • Keep everything independant
Summary • When it does go wrong • Fix the symptoms first • Then find out what actually went wrong • Start straight away • Log everything, all the time
Thank You • Michael Brunton-Spall • Lisa van Gelder • michael.brunton- • lisa.van- spall@guardian.co.uk gelder@guardian.co.uk • @bruntonspall • @techbint Giant Furry Rat - “Lost land of the Volcano”courtesy of BBC natural history unit Panic Button - http://www.flickr.com/photos/trancemist/361935363/ Long Meg Sidings - http://www.flickr.com/photos/ingythewingy/5243875486/ Server Rack - http://www.flickr.com/photos/jamisonjudd/2433102356/ Release Valve - http://www.flickr.com/photos/kayveeinc/4107697872 Ancient Planet - http://www.flickr.com/photos/gsfc/4479185727/ Solor system - http://www.flickr.com/photos/gsfc/4479185727 Gauges - http://www.flickr.com/photos/dgoodphoto/5264024028 Logs - http://www.flickr.com/photos/catzrule/5693655199 Higgs boson - http://www.flickr.com/photos/jurvetson/4233962874 Toolbox - http://www.flickr.com/photos/jrhode/4632887921 Don’t Panic sign used with permission Guardian Team used with permission
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.