background
play

Background Large-scale IT service delivery systems No longer - PowerPoint PPT Presentation

POLYGRAPH : SYSTEM FOR DYNAMIC REDUCTION OF FALSE ALERTS IN LARGE-SCALE IT SERVICE DELIVERY ENVIRONMENTS SANGKYUM KIM (UIUC) WINNIE CHENG, SHANG GUO, LAURA LUAN, DANIELA ROSU (IBM RESEARCH) ABHIJIT BOSE (GOOGLE) USENIX ATC11 (June 2011,


  1. POLYGRAPH : SYSTEM FOR DYNAMIC REDUCTION OF FALSE ALERTS IN LARGE-SCALE IT SERVICE DELIVERY ENVIRONMENTS SANGKYUM KIM (UIUC) WINNIE CHENG, SHANG GUO, LAURA LUAN, DANIELA ROSU (IBM RESEARCH) ABHIJIT BOSE (GOOGLE) USENIX ATC’11 (June 2011, Portland, OR)

  2. Background  Large-scale IT service delivery systems  No longer confined to racks within a single data center  Increasing adoption of virtualization and cloud computing  Our focus  Monitoring alerts  Significant portion of alerts are false  Polygraph  Mine historical alerts to dynamically adjust monitoring policies

  3. Basic Alert Policy Types Type Example IF A; IF (System.Virtual_Memory_Percent_Used > 90) IF (NTPhysical_Disk.Disk_Time > 80) AND IF A AND B; (NT_Physical_Disk.Disk_Time ≤ 90) IF (SMP_CPU.CPU_Status = ‘off - line’) OR IF A OR B; (SMP_CPU.Avg_CPU_Busy_15 > 95)

  4. Polygraph System Architecture Polygraph System/event source Ticket source Incident Management Resource Utilization False alert detector and Performance Problem tickets False alerts Alerts System Configuration Data Alert policy generator Tune alert Proposed alert policies Operation Data policies (SLA, Maintenance Schedule, …) Alert policy evaluator/simulator Events New/modified alert policies Expert Review Policy deployment Events on new rules. Monitoring Rule Dispatcher Monitored server Monitoring CPU Agent Disk Rule change Memory Monitoring Alerts App Monitoring System Management Events Rules

  5. Host-based Alert Policy Threshold Adjustment 400 375 350 min resource of real alerts 325 300 275 250 225 200 max resource of false alerts 175 150 125 100 75 Current threshold 50 25 0

  6. Time-based Alert Policy Threshold Adjustment (I)  Finding patterns for false alerts  Example: periodic patterns  They might include true alerts 2010-05-04 2010-05-05 2010-05-06 2010-05-07 2010-05-08 2010-05-09 2010-05-10 2010-05-11 2010-05-13 2010-05-14 2010-05-15 2010-05-16 2010-05-17 2010-05-18 2010-05-26

  7. Time-based Alert Policy Threshold Adjustment (II)  Finding patterns for true alerts  Mine true ranges  User-specified threshold given to decide the width of true range *True range threshold: 1 hour True ranges: (2-5pm), (7-9pm) 3pm 4pm 8pm

  8. Experiments Host-based threshold adjustment Host and Time-based threshold adjustment 100 100 90 80 80 Rate (%) Rate (%) 60 70 40 P1 Total Detected False Events 60 P1 Detected False Events from Hosts with True Sets 20 P2 Total Detected False Events 50 P2 Detected False Events from Hosts with True Sets 0 40 5 10 15 20 25 5 10 15 20 25 Train Data Size (Day) Train Data Size (Day) True range threshold effect 100 90 Rate (%) 80 70 60 50 30 60 120 180 True Range Threshold (min)

  9. Discussion  Leverage operational data for alert policy tuning  Anti virus (20% of a customer’s alerts)  Weighted scheme  Put emphasis on recent input  Impact of change operations  Integration of service management data is necessary  Leverage server similarity  Grouping similar servers provides a better training dataset

  10. Conclusion  How to reduce false alerts  Polygraph tunes alert policies based on historical data  To improve recall, we utilized  Localized feature: Host  High recall, barely miss true events  Time-dependent behavior  Higher recall, reasonable precision

  11. Questions ?

Recommend


More recommend