Draining the flood a combat against alert fatigue Yu Chen The - PowerPoint PPT Presentation

Draining the flood a combat against alert fatigue Yu Chen

The Alert Flood in Baidu • the amount of alerts is high – More than 100 alerts per person per day • Day time: ~75% in 17 hours • Night time: ~25% in 7 hours • Highly Redundant – # effective alerts / # alert SMS < 0.15

Observations & Solutions Observation Reason Solution Duplicate ratio: Persistent alerts Alert grouping • • 58% • Correlated alerts Attention ratio: Over-aggressive alert Alert importance level • • - 25% (at night time) importance Delivery behavior Level calibration • Receivers per alert: • In-effective oncall procedure • Oncall schedule and 3 escalation Single instance alerts: > 40% only requires simple Automatic self-healing • • 88% operations to recover

Alert Grouping • Simple grouping – Remove simple duplicates • Cross-module patterns – Reveal underlying issues • Network connectivity detection – Suppress alert surge

Simple Grouping • Grouping based on natural dimensions – Alert rule name – Deployment structure • Product, Module, Cluster, Instance • Datacenter, machine

Grouping Result {group.ab-zxcvq.AB.all:instance:B_zxcvq_FATAL}{ 总体异常实例比例 :1.36054%}{ 异常 (2):0.opr-zty5-zxcvq-000-cc.AB.bjdc,1.opr-zty5-zxcvq-000-cc.AB.bjdc}{05-02 16:49:36 - 16:54:09} {http://dwz.cn/… } • Rule name – group.ab-zxcvq.AB.all:instance:B_zxcvq_FATAL – Instance level alert • Ratio – 1.36054% • Instance list – 0.opr-zty5-zxcvq-000-cc.AB.bjdc – 1.opr-zty5-zxcvq-000-cc.AB.bjdc • Time – 05-02 16:49:36 - 16:54:09 • Link to detail page – http://dwz.cn/…

Delivery with Grouping Linger Buffer Delivered Alert info Fire time Linger time Alert A: rule1 5 20 A:rule1 A: rule2 10 30 Alert A:rule2 Source B: rule3 20 40 A:rule1 A: rule1 20 20 C: rule4 25 60

Cross-Module Patterns • Caller / Callee A – Both alerts when callee is in trouble • Association rule mining B – Transaction window starting from every alert C:rule3 A:rule1 A:rule3 B:rule2 D:rule4 B:rule2 A:rule1, B:rule2, C:rule3 B:rule2, C:rule3, D:rule4 M:ruleX à N:ruleY …… C:rule3, B:rule2

Network Connectivity • Network device failure can caused a lot of alerts • Should trigger alerts for – Most rules – Most products • Heuristic rule

Linger Time • Configurable – Different among alert rules • Extra delay to receive alerts – Less punctual • Need better ways to balance

Attention Ratio • Check existence in interval – Access log of the monitoring system • View alert detail • View relevant curves – Login log of the production machine • Exist: alert is attended • Absent: alert is ignored • Only applied to night time

Alert Calibration • Importance levels – Critical: SMS + Phone to all receivers – Major: SMS + Escalation – Warning: SMS without Escalation – Notice: Mail • Attention ratio should be compatible to levels – Push from mangers

Alert Receivers • Typical receivers of an alert – Primary oncall engineer – Secondary oncall engineer – Oncall engineer lead – Senior engineer – Manager • Primary oncall engineer handles alerts usually – But alerts always sent to all

Oncall Escalation • Alerting stages – One fixed stage • primary, secondary – Zero or more escalation stages a minutes b minutes Primary Escalation1 (Secondary)

Oncall Escalation Oncall schedule Fixed stage Escalation stage

Automatic Self-healing • Lazy log purge – Set an alert on disk free space – Delete some log when alert triggers • Granularity – Instance level • “bin_control restart” – Module/Cluster level • “curl master.a.com” • Alert – will not deliver – view alert log

Management Support • Alert importance calibration – Lower importance level • Oncall escalation – Include attention ratio into work evaluation

Decrease by 85% Number of Alerts / Weekly Total DayTime Night

Remarks • Reducing redundant alerts – Mining alert correlation for grouping – Estimate attention ratio for importance calibration – Receiver escalation mechanism – Alert self-healing mechanism • Helpful on understanding root causes of issues

chenyu07@baidu.com

Draining the flood a combat against alert fatigue Yu Chen The - PowerPoint PPT Presentation

Draining the flood a combat against alert fatigue Yu Chen The Alert Flood in Baidu the amount of alerts is high More than 100 alerts per person per day Day time: ~75% in 17 hours Night time: ~25% in 7 hours Highly

The Flood of Noah Why The Flood is an important topic Why we need answers What The

Flood SensorWeb 10-16-08 Purpose Vision of Flood Sensor Web Present status of Flood

Flood Forecasting Initiative Guy Shalev Flooding impact Flood Forecasting Flood Forecasting

8 0 Year Old Diabetic with Draining Wound Javad Parvizi MD, FRCS Professor Rothm an Institute

Core draining: neutronics, fluid dynamics, and heat transfer Main goals of this specific study

and Flood Insurance By FLOOD Mike Vernon INSURANCE Hampton Roads FIHR Flood Mitigation

OpequonWatershed.Org 64.4-mile-long stream draining 344 square miles of the northern Shenandoah

Flood-Fill Flood-fill Used in interactive paint systems. The user specify a seed by

topography data & hydrodynamic model for precise global flood simulation Dai Yamazaki

Flood Prevention of Art Strategy After June 2006 Flood David Samec, P.E., CFM Chief of

Flood Alleviation Schemes Update Clive Moon Principal Engineer (Coastal & Flood Risk

Albion Park - Flood Focus Group Duck Creek Flood Study 1 Duck Creek Catchment Area 2 Duck Creek

01 dollars In 1968, the National Flood Insurance Act created the Federal Insurance

Flood Research Center By Richard Wright How can architects respond to a inland flood research

Scottish Planning Policy (SPP) Compensatory Flood Storage / Flood Mitigation Marc Becker SEPA

Zone A Workshop How to determine Base Flood Elevation (BFE) (100-year flood) New Hampshire

Evaluation of Flood Mitigation Function of the Van Coc Lake in a Catastrophic Flood Event of the

Role of the Flood & Water Management Committee & LCC as Lead Local Flood Authority Page 1

Flood Insurance Assistance Pilot Project Helping Property Owners with Flood Insurance and Flood

Ventura County Preliminary Flood Maps FEMA Region IX May 30-31, 2017 National Flood Insurance

Flood Action Group Meeting 11th October 2017 Welcome and Introductions Objectives Flood risk

Kilbogget Park Flood Alleviation Scheme Kilbogget Park Flood Alleviation Scheme Flood

New Initiatives taken for Flood Reduction and Management after Super Flood of 2014 in Chenab River

State Flood Assessment www.TexasFloodAssessment.com Mindy Conyers, Ph.D. HGAC - Regional Flood

Draining the flood a combat against alert fatigue Yu Chen The - PowerPoint PPT Presentation

Draining the flood a combat against alert fatigue Yu Chen The Alert Flood in Baidu the amount of alerts is high More than 100 alerts per person per day Day time: ~75% in 17 hours Night time: ~25% in 7 hours Highly

The Flood of Noah Why The Flood is an important topic Why we need answers What The

Flood SensorWeb 10-16-08 Purpose Vision of Flood Sensor Web Present status of Flood

Flood Forecasting Initiative Guy Shalev Flooding impact Flood Forecasting Flood Forecasting

8 0 Year Old Diabetic with Draining Wound Javad Parvizi MD, FRCS Professor Rothm an Institute

Core draining: neutronics, fluid dynamics, and heat transfer Main goals of this specific study

and Flood Insurance By FLOOD Mike Vernon INSURANCE Hampton Roads FIHR Flood Mitigation

OpequonWatershed.Org 64.4-mile-long stream draining 344 square miles of the northern Shenandoah

Flood-Fill Flood-fill Used in interactive paint systems. The user specify a seed by

topography data &amp; hydrodynamic model for precise global flood simulation Dai Yamazaki

Flood Prevention of Art Strategy After June 2006 Flood David Samec, P.E., CFM Chief of

Flood Alleviation Schemes Update Clive Moon Principal Engineer (Coastal &amp; Flood Risk

Albion Park - Flood Focus Group Duck Creek Flood Study 1 Duck Creek Catchment Area 2 Duck Creek

01 dollars In 1968, the National Flood Insurance Act created the Federal Insurance

Flood Research Center By Richard Wright How can architects respond to a inland flood research

Scottish Planning Policy (SPP) Compensatory Flood Storage / Flood Mitigation Marc Becker SEPA

Zone A Workshop How to determine Base Flood Elevation (BFE) (100-year flood) New Hampshire

Evaluation of Flood Mitigation Function of the Van Coc Lake in a Catastrophic Flood Event of the

Role of the Flood &amp; Water Management Committee &amp; LCC as Lead Local Flood Authority Page 1

Flood Insurance Assistance Pilot Project Helping Property Owners with Flood Insurance and Flood

Ventura County Preliminary Flood Maps FEMA Region IX May 30-31, 2017 National Flood Insurance

Flood Action Group Meeting 11th October 2017 Welcome and Introductions Objectives Flood risk

Kilbogget Park Flood Alleviation Scheme Kilbogget Park Flood Alleviation Scheme Flood

New Initiatives taken for Flood Reduction and Management after Super Flood of 2014 in Chenab River

State Flood Assessment www.TexasFloodAssessment.com Mindy Conyers, Ph.D. HGAC - Regional Flood

topography data & hydrodynamic model for precise global flood simulation Dai Yamazaki

Flood Alleviation Schemes Update Clive Moon Principal Engineer (Coastal & Flood Risk

Role of the Flood & Water Management Committee & LCC as Lead Local Flood Authority Page 1