Incident Management Incident Management Making sure things go right - PowerPoint PPT Presentation

Incident Management Incident Management “Making sure things go right when they inevitably go wrong.” Gareth Eason, HEAnet for TF-NOC, Zürich, 2011-06-29

Agenda • HEAnet background: What do we do? • Why manage incidents? • How does HEAnet manage incidents? • Implementation of a new incident management system • Lessons learned

Who are HEAnet? • HEAnet is Ireland's research and education network (NREN) • Set up in 1983 as a collaborative body by the seven Irish universities and the Higher Education Authority • Became a non-profit, limited company in 1997 • Approximately 50 staff today

Network members • 7 Universities & DIT • 13 Institutes of Technology • 16 3 rd level colleges and VECs • 24 non-profit and research organisations • Government & Administrative bodies • In excess of 180,000 end users • 4,000 primary and post-primary schools

Affiliations & Representations National • IBEC – TIF/Telecoms Internet Federation • INEX/Internet Neutral Exchange • ISPAI / Internet Service Provider Association of Ireland International • EU funded Framework Projects • RIPE Network Co-ordination Centre (NCC) • DANTE/TERENA (37 countries) • GÉANT/NREN Consortium Policy Committee • JANET (UK) and JANET-CERT • MoU with Internet 2/ NGI

What do we do? • Provide high quality Internet services to our members • Enable research and learning through leading edge shared services • Act as a representative body for the ICT education & research community • Facilitate innovation and collaboration • Ensure value for money

Network Trends 1991-

Milestones 2010 2008 2009 Schools 100 Mbit/s First 10Gbps Client Resilience, Wireless Connections Connections Strategy 2011 - 2013 Data Storage National Data Centre Next Generation Network Cloud Computing Wireless

What is an incident? • An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. • Typically, something has gone wrong • Sources: – Automated alerts – Customers – NOC observations – Suppliers

Why manage incidents? Top 3 reasons to manage incidents: 1. Keep customers happy 2. Keep customers happy 3. Keep customers happy Distant 4 th reason: 4. Continuous Service Improvement

Why manage incidents? “You can't manage what you don't measure”

Why manage incidents? “You can't manage what you don't measure” Measure, manage and continually improve service

How does HEAnet manage? • Fundamentally process driven Process • Supported by NOC tools personnel • Managed by Tools NOC staff • People are the most critical

Implementation • Good people – Experienced and know what they are doing • Good processes – Tried, tested and continually improved • Poor tool support – Custom; built for a need 7 years ago – No support – Inflexible; not practical to extend – Not all incidents captured

A new tool • Evaluate available tools – Remedy, OTRS, RT, ... • Propose replacement tool • Map existing processes to new tool • Amend tool / processes to match • Plan migration to new tool • Decommission old tool

Requirements • No external facing change • Federated auth, with bypass • Integration with existing datasets • Integration with monitoring systems • Standalone capable • Resilient • DR plan (#2 item for reinstatement) • Scalable, supportable, maintainable

Requirements • Automation & Aggregation – Automate what we can – Facilitate everything else • Ensure clear, well understood, robust procedures are – in place and – will be followed / enabled • Leverage Upgrades in Core RT

Design • Two separate data centres • API for integration Failover Management UI UI API API Middleware Middleware DB DB

Design RT Ticketing Failover UI Client info API API Middleware DB

Design RT Ticketing Failover UI Client info API API Middleware Service & Circuit info API Supplier info DB API

Design alerting e-mail RT Ticketing Failover UI Client info API API Middleware Service & Circuit info API Supplier info DB API

Design alerting e-mail RT Ticketing Failover UI Client info RT Cache API API Middleware Service & Circuit info API Supplier info DB API

Buy in • Management buy-in – Reporting – Better customer service • NOC buy-in – Easier to track incidents – Better integration makes life easier • Client buy-in – Looks the same, but better service

Buy in • NOC involved from day #1 • Suggestions tracked – Fogbugz • 3-month migration from old to new – 5 th April 2011 (go-live) – 1 st July 2011 (turn off mousetrap)

Continuous improvement • E-mail filters • RT interface – Agile methodology – Multiple releases since 5 th April • AssetDB launched 28 th June 2011 – Plan for integration

Platform Primary Secondary / Failover Sysadmin & Production Production NOC Staging Staging s/w dev team Development Development

Outcomes • Much better issue tracking • More Network Operation Centre Tickets tickets 4500 4000 3500 3000 Q4 tickets opened Q3 2500 Q2 Q1 2000 1500 1000 500 0 2006 2007 2008 2009 2010 2011

Outcomes • Much better reporting

Lessons learned • Good incident management => Good customer service • Good process is key • Tool must support the process • Integration is key • Automation is great • Reporting is vital

Lessons learned • Have a DR plan (Disaster Recovery) • Test it • Break stuff, and test it again • Test it some more • Test it again How do you manage incidents if they break the tool?

Lessons Learned • Support the process • Integrate • Automate • Report • Leverage community development • Have a DR plan • Test, test, test some more!

Incident Management Incident Management Making sure things go right - PowerPoint PPT Presentation

Incident Management Incident Management Making sure things go right when they inevitably go wrong. Gareth Eason, HEAnet for TF-NOC, Zrich, 2011-06-29 Agenda HEAnet background: What do we do? Why manage incidents? How does

Incident Response & Evidence Incident Response & Evidence Incident Response &

Malcode Analysis Malcode Analysis Techniques Techniques for for Incident Handlers Incident

Planning for the Worst: The Role of Incident Response, Before Youve Had an Incident Sponsored

70 Incident Incident Management Management Program Program Colorado Department of

Incident Management Team COVID-19 Incident Briefing Thursday, April 30 2020 Ryan Brajcich

Incident Management Team COVID-19 Incident Briefing Thursday, April 23 2020 Scott Blain

GE Incident Response Insight Awareness Advantage Sean Mason Director, Incident Response

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Grating Incident resulting in LTI Agenda Background and Medical Condition Incident

BIRT BIRT Liaisons Incident Response Incident Follow-Up Q&A 1 2 PURPOSE OF

Incident Action Planning in the EOC GGC NHC 2018 GGC NHC 2018 Incident Action Planning EOC

INCIDENT INVESTIGATION ANALYSIS Nalicia Stevenson Addo Safety Manager, Cutrale What Is An

Incident Lifestyle testing issues Assessment 2 part Incident Lifestyle testing issues

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Incident Cause Analysis Method (ICAM) Report LDVs HEAD ON COLLISION POINTS OF DISCUSSION

Incident Command: The far side of the edge Lisa Phillips Tom Daly Maarten Van Horenbeeck

PASS : Strengthening and Democratizing Enterprise Password Hardening Ari Juels Jacobs

TESTING & DEPLOYING MICROSERVICES Sam Newman Flowcon, September 2014 1 Shipping Returns

Monitoring Mum Open-source Telecare Andrew Findlay April 2017 Once upon a time... We can cope

PREPARING FOR A UNIFIED IMC ARCHITECTURE BY 2020 STEVE WILKES CO-FOUNDER & CTO OF STRIIM

Path to Resilient and Observable Microservices Slides: https://slides.peterj.dev @pjausovec 1 /

Towards Omnia: a Monitoring Factory for Quality-Aware DevOps Apr 27 th , 2017 Marco MIGLIERINA

service Piotr Szwed and Kamil Pkala AGH University of Science and Technology Department of

Architecture Needs a Time Series Platform Thom Crowe, Community Manager InfluxData As you

Sambuz

Useful Links

Newsletter

Mail Us

Incident Management Incident Management Making sure things go right - PowerPoint PPT Presentation

Incident Management Incident Management Making sure things go right when they inevitably go wrong. Gareth Eason, HEAnet for TF-NOC, Zrich, 2011-06-29 Agenda HEAnet background: What do we do? Why manage incidents? How does

Incident Response &amp; Evidence Incident Response &amp; Evidence Incident Response &amp;

Malcode Analysis Malcode Analysis Techniques Techniques for for Incident Handlers Incident

Planning for the Worst: The Role of Incident Response, Before Youve Had an Incident Sponsored

70 Incident Incident Management Management Program Program Colorado Department of

Incident Management Team COVID-19 Incident Briefing Thursday, April 30 2020 Ryan Brajcich

Incident Management Team COVID-19 Incident Briefing Thursday, April 23 2020 Scott Blain

GE Incident Response Insight Awareness Advantage Sean Mason Director, Incident Response

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Grating Incident resulting in LTI Agenda Background and Medical Condition Incident

BIRT BIRT Liaisons Incident Response Incident Follow-Up Q&amp;A 1 2 PURPOSE OF

Incident Action Planning in the EOC GGC NHC 2018 GGC NHC 2018 Incident Action Planning EOC

INCIDENT INVESTIGATION ANALYSIS Nalicia Stevenson Addo Safety Manager, Cutrale What Is An

Incident Lifestyle testing issues Assessment 2 part Incident Lifestyle testing issues

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Incident Cause Analysis Method (ICAM) Report LDVs HEAD ON COLLISION POINTS OF DISCUSSION

Incident Command: The far side of the edge Lisa Phillips Tom Daly Maarten Van Horenbeeck

PASS : Strengthening and Democratizing Enterprise Password Hardening Ari Juels Jacobs

TESTING &amp; DEPLOYING MICROSERVICES Sam Newman Flowcon, September 2014 1 Shipping Returns

Monitoring Mum Open-source Telecare Andrew Findlay April 2017 Once upon a time... We can cope

PREPARING FOR A UNIFIED IMC ARCHITECTURE BY 2020 STEVE WILKES CO-FOUNDER &amp; CTO OF STRIIM

Path to Resilient and Observable Microservices Slides: https://slides.peterj.dev @pjausovec 1 /

Towards Omnia: a Monitoring Factory for Quality-Aware DevOps Apr 27 th , 2017 Marco MIGLIERINA

service Piotr Szwed and Kamil Pkala AGH University of Science and Technology Department of

Architecture Needs a Time Series Platform Thom Crowe, Community Manager InfluxData As you

Sambuz

Useful Links

Newsletter

Mail Us

Incident Response & Evidence Incident Response & Evidence Incident Response &

BIRT BIRT Liaisons Incident Response Incident Follow-Up Q&A 1 2 PURPOSE OF

TESTING & DEPLOYING MICROSERVICES Sam Newman Flowcon, September 2014 1 Shipping Returns

PREPARING FOR A UNIFIED IMC ARCHITECTURE BY 2020 STEVE WILKES CO-FOUNDER & CTO OF STRIIM