POLYGRAPH : SYSTEM FOR DYNAMIC REDUCTION OF FALSE ALERTS IN LARGE-SCALE IT SERVICE DELIVERY ENVIRONMENTS SANGKYUM KIM (UIUC) WINNIE CHENG, SHANG GUO, LAURA LUAN, DANIELA ROSU (IBM RESEARCH) ABHIJIT BOSE (GOOGLE) USENIX ATC’11 (June 2011, Portland, OR)
Background Large-scale IT service delivery systems No longer confined to racks within a single data center Increasing adoption of virtualization and cloud computing Our focus Monitoring alerts Significant portion of alerts are false Polygraph Mine historical alerts to dynamically adjust monitoring policies
Basic Alert Policy Types Type Example IF A; IF (System.Virtual_Memory_Percent_Used > 90) IF (NTPhysical_Disk.Disk_Time > 80) AND IF A AND B; (NT_Physical_Disk.Disk_Time ≤ 90) IF (SMP_CPU.CPU_Status = ‘off - line’) OR IF A OR B; (SMP_CPU.Avg_CPU_Busy_15 > 95)
Polygraph System Architecture Polygraph System/event source Ticket source Incident Management Resource Utilization False alert detector and Performance Problem tickets False alerts Alerts System Configuration Data Alert policy generator Tune alert Proposed alert policies Operation Data policies (SLA, Maintenance Schedule, …) Alert policy evaluator/simulator Events New/modified alert policies Expert Review Policy deployment Events on new rules. Monitoring Rule Dispatcher Monitored server Monitoring CPU Agent Disk Rule change Memory Monitoring Alerts App Monitoring System Management Events Rules
Host-based Alert Policy Threshold Adjustment 400 375 350 min resource of real alerts 325 300 275 250 225 200 max resource of false alerts 175 150 125 100 75 Current threshold 50 25 0
Time-based Alert Policy Threshold Adjustment (I) Finding patterns for false alerts Example: periodic patterns They might include true alerts 2010-05-04 2010-05-05 2010-05-06 2010-05-07 2010-05-08 2010-05-09 2010-05-10 2010-05-11 2010-05-13 2010-05-14 2010-05-15 2010-05-16 2010-05-17 2010-05-18 2010-05-26
Time-based Alert Policy Threshold Adjustment (II) Finding patterns for true alerts Mine true ranges User-specified threshold given to decide the width of true range *True range threshold: 1 hour True ranges: (2-5pm), (7-9pm) 3pm 4pm 8pm
Experiments Host-based threshold adjustment Host and Time-based threshold adjustment 100 100 90 80 80 Rate (%) Rate (%) 60 70 40 P1 Total Detected False Events 60 P1 Detected False Events from Hosts with True Sets 20 P2 Total Detected False Events 50 P2 Detected False Events from Hosts with True Sets 0 40 5 10 15 20 25 5 10 15 20 25 Train Data Size (Day) Train Data Size (Day) True range threshold effect 100 90 Rate (%) 80 70 60 50 30 60 120 180 True Range Threshold (min)
Discussion Leverage operational data for alert policy tuning Anti virus (20% of a customer’s alerts) Weighted scheme Put emphasis on recent input Impact of change operations Integration of service management data is necessary Leverage server similarity Grouping similar servers provides a better training dataset
Conclusion How to reduce false alerts Polygraph tunes alert policies based on historical data To improve recall, we utilized Localized feature: Host High recall, barely miss true events Time-dependent behavior Higher recall, reasonable precision
Questions ?
Recommend
More recommend