Automated Troubleshooting of Live Site Issues Sriram Srinivasan - PowerPoint PPT Presentation

Automated Troubleshooting of Live Site Issues Sriram Srinivasan PayPal SRE May 23, 2017

About Me • MTS 2, Software Engineer @ PayPal • Site Reliability Engineer

Agenda • A bit about PayPal SRE • Troubleshooting Challenges • Manual Troubleshooting Process • Requirements of Automated Troubleshooting Platform • Evolution of the Architecture • Architecture in Detail • Major features of the Automated Troubleshooting Platform • How to troubleshoot any type of Issues through Workflows? • Future Plans

A bit about PayPal SRE • Focus on the Key Aspects of Site Reliability:  Availability  Performance  Change Management  Monitoring and Alerting  Incident Management • To troubleshoot and drive resolution of Live issues (from every domain) across the company.

Troubleshooting Challenges • Manual: • Landscape:  Not having enough data to troubleshoot  Newer Products/Flows  Knowledge of the area/domain  People, Product & Bugs changing teams  Multiple signal generators  Growing number of issues as we grow  Inherent urgency in resolving  Growing signal generators  Takes time (due to human intervention)  Troubleshooting system generated Alerts not scalable  Past troubleshooting knowledge not always leveraged  Low priority issues don’t get enough attention  Expiry of the logs

Manual Troubleshooting Process • Issue Comprehension • Categorize Issue (System vs Application) • Look for Samples • Tag Samples with the corresponding logs (spanning multiple applications) • Check further in:  Stack Trace (Logs from the point of entry)  Recent Pushes (pertaining to the application/service)  Deployment Logs  Databases  In-house Alerting & Monitoring tools  In-house Admin tool  Code base  Production box  Bug Tracker/Ticketing Systems  …

Requirements • Explicit Functional Requirements:  Automate the troubleshooting process • Implicit Functional Requirements:  Provision to talk to disparate signal generators/data sources (like log servers, DB, …) synchronously/asynchronously  Adaptable to the growing signal generators/data sources  Ability to troubleshoot any type of issue/alert  Troubleshooting data augmentation/enrichment  Assimilation of the results from various data sources  Retain concerned Logs/troubleshooting info forever  Single place to view the auto-troubleshooting result  Build a Platform

Evolution of the Architecture • Key Abstractions: • Architecture Patterns:  Identify the Type of Issue/Alert  Multitier / n-tier architecture (Workflow)  Service-oriented architecture (SOA)  Workflow has the say on how to  Presentation – abstraction – control / troubleshoot (control strategy) (MVC)  Augment the Troubleshooting Data  Blackboard system  Invoke various Fetchers in the order prescribed (diverse specialized modules)  Gather results in a common place  Assimilate / Solution is incrementally constructed

Architecture

Workflows • Workflow has details on how to enrich the troubleshooting data and what Fetchers would be required (including the order of invocation). • Workflows are described in JSON and is nothing but a union of various Sections (or Directives).

Major features of the Automated Troubleshooting Platform • Pluggable:  Fetchers are Pluggable. We can add as many Fetchers (for Data Sources) as we want.  Language for the development of the Fetcher is not fixed. • Expandable:  Add as many Workflows (Products/Flows) as possible. Workflow says what Fetchers to be invoked and in what order.  Issues and various types of Alerts can be triaged. • Scalable by Design:  Asynchronous invocation of Fetchers.  Underlying technologies will also help.

Benefits • Fast Triaging of all Issues & Alerts:  All issues and alerts are auto-triaged in minutes.  Reduces MTTT (Mean Time to Triage) and thus reduced MTTR (Mean Time to Resolve) • Less Cost to Company:  Reduces the Sustaining Budget of teams. Teams can expend their effort on building other cool features. • Customer Satisfaction:  Better Customer Satisfaction as logs are available forever and we don't need to go back to our customers. • Better Insights:  As a single platform, it has gotten all the issues and their resolution. Thus this data platform can provide various insights.  Past triaging knowledge is leveraged for future troubleshooting.

Future Plans • Platform Usage:  Continuously evolve the platform by adding more Fetchers • Disposition:  Smart Issue Classification & Intelligent Issue Routing • Data Platform:  Cataloguing the Issue with additional information (Resolution details, additional Notes)  Insights generation sliced by products, flows, root cause • Proactive Measures:  Where more issues are coming and invest there by leveraging the data

Questions ?

Automated Troubleshooting of Live Site Issues Sriram Srinivasan - PowerPoint PPT Presentation

Automated Troubleshooting of Live Site Issues Sriram Srinivasan PayPal SRE May 23, 2017 About Me MTS 2, Software Engineer @ PayPal Site Reliability Engineer Agenda A bit about PayPal SRE Troubleshooting Challenges Manual

Troubleshooting with human- readable automated reasoning Alva L. Couch, Tufts University,

Troubleshooting & Q&A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

TROUBLESHOOTING Performing basic Acronis Backup and Acronis Backup Cloud troubleshooting Acronis

Automatically Generating Predicates and Solutions for Configuration Troubleshooting * Ya-Yunn Su

MySQL Performance Optimization and Troubleshooting with PMM Peter Zaitsev, CEO, Percona Percona

Troubleshooting PostgreSQL with pgCenter Alexey Lesovsky alexey.lesovsky@dataegret.com

Provenance for System Troubleshooting Marc Chiarini Harvard SEAS TaPP '11 A Day in the Life...

World 201 1 Help! Problem Solving and Troubleshooting Daniel Rodwell Australian National

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

TROUBLESHOOTING AND APPEALS Health Access Basic Benefits Training February 27, 2020 Nancy

Web-Based SIS Troubleshooting Simplifying State Reporting Cycles Agenda Resources

Lawn Basics & Turf Troubleshooting Presentation Q & A Lawn Basics & Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&S

902 Grant Disbursements: Ti Tips, Tricks and Troubleshooting for a T i k d T bl h ti f

Decision theoretic troubleshooting Ji r Vomlel Academy of Sciences of the Czech Republic

Troubleshooting AWS App Workshop Splunk Add-on for AWS 4.3+ Kamilo Amir | Splunk Cloud Architect

Troubleshooting for Intent-based Networking Joon-Myung Kang and Mario A. Snchez Hewlett

Understanding & Troubleshooting Mortality Composting Developed by Dr. Tom Glanville and Dr.

Automated Connected - Mobile Strategies & Actions towards Automated & Connected

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

DESIGN, TECHNOLOGIES AND TROUBLESHOOTING We help our clients to solve business problems using

CiviReport 101 - A gentle introduction to creating, modifying and troubleshooting CiviCRM

Creating a Troubleshooting Guide Whats wrong with these sentences? 1. Secretaries are

Automated Troubleshooting of Live Site Issues Sriram Srinivasan - PowerPoint PPT Presentation

Automated Troubleshooting of Live Site Issues Sriram Srinivasan PayPal SRE May 23, 2017 About Me MTS 2, Software Engineer @ PayPal Site Reliability Engineer Agenda A bit about PayPal SRE Troubleshooting Challenges Manual

Troubleshooting with human- readable automated reasoning Alva L. Couch, Tufts University,

Troubleshooting &amp; Q&amp;A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

TROUBLESHOOTING Performing basic Acronis Backup and Acronis Backup Cloud troubleshooting Acronis

Automatically Generating Predicates and Solutions for Configuration Troubleshooting * Ya-Yunn Su

MySQL Performance Optimization and Troubleshooting with PMM Peter Zaitsev, CEO, Percona Percona

Troubleshooting PostgreSQL with pgCenter Alexey Lesovsky alexey.lesovsky@dataegret.com

Provenance for System Troubleshooting Marc Chiarini Harvard SEAS TaPP '11 A Day in the Life...

World 201 1 Help! Problem Solving and Troubleshooting Daniel Rodwell Australian National

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

TROUBLESHOOTING AND APPEALS Health Access Basic Benefits Training February 27, 2020 Nancy

Web-Based SIS Troubleshooting Simplifying State Reporting Cycles Agenda Resources

Lawn Basics &amp; Turf Troubleshooting Presentation Q &amp; A Lawn Basics &amp; Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&amp;S

902 Grant Disbursements: Ti Tips, Tricks and Troubleshooting for a T i k d T bl h ti f

Decision theoretic troubleshooting Ji r Vomlel Academy of Sciences of the Czech Republic

Troubleshooting AWS App Workshop Splunk Add-on for AWS 4.3+ Kamilo Amir | Splunk Cloud Architect

Troubleshooting for Intent-based Networking Joon-Myung Kang and Mario A. Snchez Hewlett

Understanding &amp; Troubleshooting Mortality Composting Developed by Dr. Tom Glanville and Dr.

Automated Connected - Mobile Strategies &amp; Actions towards Automated &amp; Connected

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

DESIGN, TECHNOLOGIES AND TROUBLESHOOTING We help our clients to solve business problems using

CiviReport 101 - A gentle introduction to creating, modifying and troubleshooting CiviCRM

Creating a Troubleshooting Guide Whats wrong with these sentences? 1. Secretaries are

Troubleshooting & Q&A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

Lawn Basics & Turf Troubleshooting Presentation Q & A Lawn Basics & Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&S

Understanding & Troubleshooting Mortality Composting Developed by Dr. Tom Glanville and Dr.

Automated Connected - Mobile Strategies & Actions towards Automated & Connected