apps behaving badly
play

Apps Behaving Badly Michael Brunton-Spall Lisa van Gelder The - PowerPoint PPT Presentation

Apps Behaving Badly Michael Brunton-Spall Lisa van Gelder The inevitability of failure Systems will fail Architect for failure System independence Each system should cope on its own Some systems are critical Redundancy where


  1. Apps Behaving Badly Michael Brunton-Spall Lisa van Gelder

  2. The inevitability of failure • Systems will fail • Architect for failure

  3. System independence • Each system should cope on its own • Some systems are critical • Redundancy where necessary • This is not “Scaling”

  4. Core CMS Discussion Apache Django Apache Apache Zeitgeist Apache Java GAE Java MPs Expenses DB Apache EC2

  5. When it all goes wrong

  6. Apply fences • Remove misbehaving servers from load balancers • Turn off expensive features • Make your site go faster at expense of dynamic content

  7. Don’t start with root analysis • You don’t need to know what went wrong • Fix the symptoms first • Then work out cause

  8. Causation analysis for fun and profit • Devs and Ops are good at guessing • Devs and Ops are bad at guessing correctly

  9. How to analyse a failure • Loosly based on “Analysis of Competing Hypothesis” • Written for the NSA

  10. Hypothesis testing • Hard to prove causation • Easy to prove non-correlation • Evidence that this hypothesis is false

  11. Generate lots of hypothesis

  12. How do you get the proof?

  13. Allocate Priorities and Staff

  14. Logs, Logs, Logs, Logs • Trigger a stack dump on hanging servers • backup / copy logs of affected server • JVM log • Stdout • Application log

  15. stack traces, heap dumps, core dumps • Get as much info as possible • Heap dumps can take a long time, so only if necessary

  16. Log analysis is your friend • Simple tools for a simple life • Grep, Cut, Uniq, Sort • find the bit of log you are interested in • calculate duration and order by slowest • Sed, Awk

  17. zgrep "RequestLoggingFilter - Request for.*completed in " $LOGFILE | grep -v " /management/" | cut -d" " - f1,2,3,10,13 > $COMPLETED_REQUESTS_FILE cat $COMPLETED_REQUESTS_FILE | cut -d " " -f5 | sort -nr | uniq -c | awk '{ SUM += $1; print $2, SUM }' > $CUMULATIVE_REQUESTS_AT_OR_ABOVE_RESPONS E_TIME_FILE

  18. Write what you need • Log Analyser • MySQL database • Parses application logs • Can now query database • What DBcalls does this URL make? • What URLS make this DBcall?

  19. It’s everybody’s responsibility • Accessing logs • Database analytics • Building tools to help

  20. Do it ASAP before it happens again. • Crack team starts analysis within minutes if possible • Sometimes crack team is just 1 person

  21. Preventing Emergencies

  22. Core systems vs Periphery systems • Core systems must be reliable and up • Periphery systems may be down • But preferably are not!

  23. What is a microapp • A periphery system • Can be released in isolation • Can be less reliable • Can be less performant • Timeout • Components collapse

  24. Microapps • How we create separation of systems • Similar to SSI’s - HTML placeholders • Powered by HTTP • Load balancers, Proxies, Caching

  25. Switches

  26. Feature switches • Turn on or off features as necessary • HTTP Urls to expose switches • POST not GET • Switch dashboard to see status

  27. Per server or global? • Global requires shared state • Global lets you flick switch once for all servers • Per server is less complex • Lets you turn a feature on for a single server

  28. Simple tools for simple tasks • for x in 01 02 03 04; do curl -d status=off http://server$x/switch/x; done • Now you have global switches :) • As compared to using ZooKeeper

  29. Switchable Microapps • Ability to turn off an entire microapp • Collapse all relevant components • Helpful if microapp is slow

  30. Responsibility and Authority • Do not need to get “approval” to turn off any microapp • Operations team can make judgement calls • Need to ensure app can be bought back ASAP

  31. Emergency Mode

  32. Emergency Mode • Rendering a page takes time • As a news site we have unexpected surges in traffic • We need to be able to trade off dynamic pages for speed • Often one page gets sudden heavy traffic

  33. Page Pressing • Emergency mode needs a bit more omph • Not just in memory cache, but a full page cache • Stored on disk as generated HTML • Served as static files, therefore over 1200 pps

  34. Really cache everything • HTML page is fully generated • Except for microapps • Emergency mode for CMS doesn’t affect microapps • Microapp Cache for microapps

  35. Caching an infinite set • There are lots of pages on guardian site • 1.4 million pieces of content • 25,000 keyword pages • http://www.guardian.co.uk/travel/france +travel/skiing • Can’t cache them all

  36. Cache whats important • Content - when modified • including during emergency mode • Navigation - Every 2 weeks • can force page press • Automatic (eg tag combiners) - Never • Automatic but important - Every 2 weeks

  37. Monitoring • Or how do I know what to turn off?

  38. Always provide stats • Consistent format • Aggregate stats at each level

  39. Indicate where issues are • Check high up in architecture first • Indicates what is causing the problem • Breakdown to next level

  40. Automatic switches • Release valves • Emergency mode • Database off mode

  41. Switch if a threshold is met • If average page response time is higher than threshold • Reset after timeout (say 60 seconds) • Prevents Ping-Pong of switches • Really handy for GC issues, Network issues etc.

  42. Summary

  43. Summary • Expect Failure • Plan for failure • At 4am • Keep it simple • Keep everything independant

  44. Summary • When it does go wrong • Fix the symptoms first • Then find out what actually went wrong • Start straight away • Log everything, all the time

  45. Thank You • Michael Brunton-Spall • Lisa van Gelder • michael.brunton- • lisa.van- spall@guardian.co.uk gelder@guardian.co.uk • @bruntonspall • @techbint Giant Furry Rat - “Lost land of the Volcano”courtesy of BBC natural history unit Panic Button - http://www.flickr.com/photos/trancemist/361935363/ Long Meg Sidings - http://www.flickr.com/photos/ingythewingy/5243875486/ Server Rack - http://www.flickr.com/photos/jamisonjudd/2433102356/ Release Valve - http://www.flickr.com/photos/kayveeinc/4107697872 Ancient Planet - http://www.flickr.com/photos/gsfc/4479185727/ Solor system - http://www.flickr.com/photos/gsfc/4479185727 Gauges - http://www.flickr.com/photos/dgoodphoto/5264024028 Logs - http://www.flickr.com/photos/catzrule/5693655199 Higgs boson - http://www.flickr.com/photos/jurvetson/4233962874 Toolbox - http://www.flickr.com/photos/jrhode/4632887921 Don’t Panic sign used with permission Guardian Team used with permission

Recommend


More recommend