Anomaly Detection Fault Tolerance Anticipation Patterns John Allspaw SVP, Tech Ops Qcon London 2012 Wednesday, March 7, 12
Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Do To Look For Has Happened (Monitoring) (Learning) Wednesday, March 7, 12
Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Do To Look For Has Happened (Monitoring) (Learning) Wednesday, March 7, 12
Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Do To Look For Has Happened (Monitoring) (Learning) Wednesday, March 7, 12
Anomaly Detection Wednesday, March 7, 12
Anomaly Detection • Getting at the state of health • Evaluating the state of health • Components AND systems Wednesday, March 7, 12
Supervisory Example: Active health check check_http Component Monitor (webserver) exit 0 200 OK Wednesday, March 7, 12
Supervisory check_http Pros: Component Monitor Easy to implement (webserver) Easy to understand exit 0 200 OK Well-known pattern Cons: Messaging can fail Scalability is limited Wednesday, March 7, 12
Supervisor Sensitivity 1 sec timeout 1 retry 3 sec interval 1s 1s X X 3s 3 3 3 3 3 3 3 (7.9 sec exposure) Up to ~2.9s for the previous interval Wednesday, March 7, 12
Supervisor Sensitivity Request latency (max = 0.9s) Schedule check_http Latency Component Monitor (max = N) (webserver) exit 0 200 OK Response latency (max = 0.9s) Wednesday, March 7, 12
Supervisor Sensitivity How many seconds of errors can you tolerate serving? Wednesday, March 7, 12
Supervisory Example: Interval Passive health check Component Monitor (webserver) DISK consumption exit 0 within bounds Wednesday, March 7, 12
Supervisory Example: Interval Passive health check Pros: Efficient Component Monitor (webserver) Scalability is different exit 0 DISK Fewer moving parts consumption within bounds Less exposure Can submit to multiple places Cons: Nonideal for network-based services Different tuning (windowed expectation) Wednesday, March 7, 12
Supervisory Example: Passive health check ✓ { ✓ Interval ✓ ? E M ? Component I ✓ T ✓ Wednesday, March 7, 12
Supervisory Example: Passive health check ✓ ✓ ✓ { ? Interval Interval Schedule ? Component ✓ Latency E M ✓ I T Exposure = (Schedule + Interval )*UnknownConsecutiveIntervals+1 Wednesday, March 7, 12
Frequency and Transience Probability Probability Of False Of Positives Nondetection Short intervals Long intervals Low # of retries High # of retries Short timeouts Long timeouts Wednesday, March 7, 12
In-Line Example: Passive application event logging monitor application Wednesday, March 7, 12
Supervisory Example: Passive application event logging monitor application Pros: On-demand publish Cons: Onus is on the app Can’t be 100% sure it’s working Wednesday, March 7, 12
Supervisory Example: Passive application event logging monitor application Positive events (sales, registrations, etc.) Negative events (errors, exceptions, etc.) Lack or presence of data mean different things, so history is paramount. Wednesday, March 7, 12
Context Wednesday, March 7, 12
Evaluation what is ‘abnormal’ ? Wednesday, March 7, 12
10 9 Response Time 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Time Wednesday, March 7, 12
Static Thresholds 10 9 Critical Response Time 8 7 Warning 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Time Wednesday, March 7, 12
Static Thresholds 10 9 Critical Response Time 8 7 Warning 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Time Wednesday, March 7, 12
Static Thresholds 10 9 Critical Response Time 8 7 Warning 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Time Wednesday, March 7, 12
Static Thresholds 10 9 Critical Response Time 8 7 Warning 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Time Wednesday, March 7, 12
Static Thresholds Wednesday, March 7, 12
Static Thresholds Wednesday, March 7, 12
Context Normal? Wednesday, March 7, 12
Context 24 hours Wednesday, March 7, 12
Context 7 days Wednesday, March 7, 12
Context Normal But Noisy Wednesday, March 7, 12
Context Smoothing? Wednesday, March 7, 12
Context Holt-Winters Exponential Smoothing Recent points influencing a forecast, exponentially decreasing influence backwards in time. en.wikipedia.org/wiki/Exponential_smoothing Wednesday, March 7, 12
Context Aberrant Behavior Detection in Time Series for Network Monitoring http://static.usenix.org/events/lisa00/ full_papers/brutlag/brutlag_html/ Wednesday, March 7, 12
Dynamic Thresholds Wednesday, March 7, 12
Dynamic Thresholds Upper bound Raw data Lower bound Wednesday, March 7, 12
Dynamic Thresholds Hrm.... Wednesday, March 7, 12
Dynamic Thresholds Hrm.... Wednesday, March 7, 12
Dynamic Thresholds Holt-Winters Aberration Ah! Wednesday, March 7, 12
Dynamic Thresholds Graphite metrics collection w/Holt-Winters abberations http://graphite.wikidot.com/ Nagios check for Graphite data https://github.com/etsy/nagios_tools/blob/master/check_graphite_data Wednesday, March 7, 12
Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What What To Expect To Do To Look For Has Happened (Monitoring) (Learning) Wednesday, March 7, 12
FAULT TOLERANCE Wednesday, March 7, 12
Detection of fault X Triggers corrective action Y Clean up, report back (RECOVERY OR MASKING) Wednesday, March 7, 12
Variation Tolerance Wednesday, March 7, 12
Adaptive Systems Expected Variation Wednesday, March 7, 12
Adaptive Systems Expected Variation Wednesday, March 7, 12
Adaptive Systems Expected Variation Wednesday, March 7, 12
New Disturbances Compensation is Arise Exhausted Disturbance Expected Variation Control compensation decompensation Woods, 2011 Wednesday, March 7, 12
New Disturbances Compensation is Arise Exhausted Disturbance Expected Variation Control compensation decompensation Wednesday, March 7, 12
New Disturbances Compensation is Arise Exhausted Variation Disturbance Expected Fault Variation Control compensation decompensation Wednesday, March 7, 12
Variations != Faults Wednesday, March 7, 12
Dead Corrupt Late Wrong Wednesday, March 7, 12
Fault Tolerance Redundancy Spatial (server, network, process) Temporal (checkpoint, “rollback”) Informational (data in N locations) Wednesday, March 7, 12
Fault Tolerance Redundancy Spatial (server, network, process) Temporal (checkpoint, “rollback”) Informational (data in N locations) Wednesday, March 7, 12
Spatial Redundancy 2 2 Wednesday, March 7, 12
Spatial Redundancy Active/Active Wednesday, March 7, 12
Spatial Redundancy Active/Passive Wednesday, March 7, 12
Spatial Redundancy Roaming Spare Dedicated Spare Wednesday, March 7, 12
In-Line Fault Tolerance PHP App (thrift client) Thrift Search (Lucene/Solr) • Connect timeout • Send timeout • Receive timeout Wednesday, March 7, 12
In-Line Fault App Tolerance X Search (Lucene/Solr) 1. App attempts connection, can’t 2. Caches APC user object with 60s “TTL” key=server:port 3. Moves to next server in rotation, skipping any found in APC Wednesday, March 7, 12
In-Line Fault Tolerance http://thrift.apache.org/ /lib/php/src/TSocketPool.php Wednesday, March 7, 12
In-Line Fault Tolerance Pros: Distributed checking and perspective Handles transient failures Auto-recovery Cons: Onus is on the app for implementation Wednesday, March 7, 12
Fault Tolerance Nagios Event Handlers Attempt to recover from specific conditions Chain together recovery actions http://nagios.sourceforge.net/docs/3_0/ eventhandlers.html Wednesday, March 7, 12
If (fault X) then HUP process; re-check If (OK) then notify+exit ELSE Hard restart process; re-check If (OK) then notify+exit ELSE Remove from production; notify+exit Wednesday, March 7, 12
How many seconds of errors can you tolerate serving? Wednesday, March 7, 12
Fail Closed When fault is found, and can’t be recovered or masked, operations cease to protect the rest of the system from damage. Wednesday, March 7, 12
Depth and Dependencies Load Balancers Monitor Health check App DB Wednesday, March 7, 12
Depth and Dependencies WARNING: Load Balancers Monitor Health check Don’t be too App crazy DB Wednesday, March 7, 12
Fail Closed Aggregate Cluster Checking X X X X If (clusterfail > 25%) then notify+exit ELSE OK Wednesday, March 7, 12
Recommend
More recommend