Lessons From Building Automation For a Large Distributed Database - PowerPoint PPT Presentation

Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL

Presenters Leigh Johnson Ameet Kotian (she/her/hers) (he/him/his) Staff DRE, Slack Staff DRE, Slack Google Developer Expert Past - SRE, Twitter (Machine Learning) @ameetkotian @grepLeigh

Slack’s mission is to make people’s working lives simpler , more pleasant , and more productive .

Agenda 1. Evolution of Slack Automation 2. Case Study: Self-Healing Databases 3. Lessons 4. Q & A

Evolution of Monitoring Alerts Database Checklists Automation Scripts Self-Healing Stateful Systems in 4 Steps Automated Workflows

Monitoring Alerts Metrics Events Logs Traces Dashboards PagerDuty

Just follow these 19 easy steps! Checklists Runbooks Shared documents Lots of hand-over Limited by team capacity

Convert Runbooks to Code

Just follow these 5 easy steps! Manual Scripts $ ./provision.sh $ ./backup_restore.sh Difficult to maintain $ ./validate_replication.sh $ ./service_discovery_stuff.sh Context switching $ ./deprovision_server.sh API contracts Multi-step process or...

Just follow this 1 easy steps! Write a Do-Everything Script $ ./fix-it-now.sh

Automated Workflows Systems (not humans ) Detects failure mode Executes appropriate response Fail Open

You will still need firefighters after installing a fire suppression system. Automation is not a magic bullet.

Automated Workflows are an Investment Many teams stop automating at the scripts stage Instrument, monitor, support n + 1 systems

Do Due Diligence If automation is an investment ... Quantify value proposition Measure return on investment

Do Due Diligence If automation is an investment ... Quantify value proposition Measure return on investment Include qualitative data

Are you ready to kill your pet databases?

How did we take Slack from...

Presenters Leigh Johnson Ameet Kotian (she/her/hers) (he/him/his) Staff DRE, Slack Staff DRE, Slack Google Developer Expert Past - SRE, Twitter (Machine Learning) @ameetkotian @grepLeigh

Case Study Building automation for remediating database failures

Goals Reliably detect MySQL host failures

Goals Reliably detect MySQL host failures Automatically remediate failed MySQL hosts

Goals Reliably detect MySQL host failures Automatically remediate failed MySQL hosts Scale (security fixes, kernel upgrades etc.)

Slack’s database architecture

Self-manage MySQL on AWS i3 Slack’s instances database architecture Data is sharded across thousands of hosts Two main types of clusters - Legacy and Vitess

1. Legacy shard Application level team-sharded active primary-primary MySQL setup.

1. Legacy shard Application level team-sharded active primary-primary MySQL setup. Some shards have read replicas

Strength in Numbers: Slack's For more Database Architecture. details... 2nd Oct 2019, 1:30 PM Guido Iaquinti, Josh Varner

What is Vitess? Slack is moving to Database solution for MySQL Vitess Deploy, scale and manage large MySQL cluster Built on top of MySQL replication and InnoDB MySQL features + scalability of a NoSQL database Open source project by YouTube (Google) Started in 2010 Cloud Native Computing Foundation endorsed project Ability to run each component in a container

2. Vitess shard primary-replica MySQL setup ~40% of Slack’s database queries served by Vitess

My First 90 Days with Vitess. For more details on Vitess... 2nd Oct 2019, 11:00 AM Morgan Tocker https://vitess.io/ https://vitess.slack.com/

What if there is a failure? Legacy shard Vitess shard

Salvage or replace?

Always replace!

Failure detection

Auto remediation requires accurate failure signals

We use Orchestrator for automatic master failovers for Vitess cluster and it met all the requirements to detect host failures https://github.com/github/orchestrator

Primary failures Legacy shard Vitess shard

Use orchestrator hooks triggered on master failovers

Replica failures Legacy shard Vitess shard

Use orchestrator problems api to detect replica failures

Orchestrator can Orchestrator is distributed ● be used to and uses multiple probes to generate detect failures. It uses knowledge of mysql ● accurate failure state and replication to detect signals failures. ● Rich set of APIs Has inbuilt concept of a shard ● Downtime/Maintenance ● mode

Automated Remediation at Scale

Failure Event Provision Event Handler Workflow ???

Automation Components Task Queue Scheduler APIs

Automation Components Workflows Audit Logs Web UI

Failure Event Provision Event Handler Workflow Celery

Distributed Task Queue Framework Celery Manual Script

Distributed Task Queue Framework Celery Task API

Distributed Task Queue Framework Celery Task API Workflows

Distributed Task Queue Framework Celery Task API Workflows Queue Isolation Rate Limits Retry Behavior

Distributed Task Queue Framework Celery Task API Workflows Queue Isolation Rate Limits Retry Behavior Scheduler

Distributed Task Queue Framework Celery Task API Workflows Queue Isolation Rate Limits Retry Behavior Scheduler pip install celery-flower Web UI

Distributed Task Queue Framework Celery Task API Workflows Queue Isolation Rate Limits Retry Behavior pip install celery-slack-webhooks Scheduler Web UI github.com/leigh-johnson/celery-slack-webhooks Slack Notifications

Distributed Lock Any Strongly Consistent DB

Lessons learned

Three important things - Safety, safety, safety...

Automation software is just like regular software...use the same scalability/reliability principles

Automation software is just regular software... release early and often

You need a rollout strategy… And a rollback strategy

Show Value Proposition

Commitment to automation

Questions? We are hiring in DREs, SREs in San Francisco and Dublin locations

Lessons From Building Automation For a Large Distributed Database - PowerPoint PPT Presentation

Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL Presenters Leigh Johnson Ameet Kotian (she/her/hers) (he/him/his) Staff DRE, Slack Staff DRE, Slack Google

How to Test the Ability of Large-Scale, Distributed Software Systems to Cope with Failures Pavel

Lessons Learned from Building a Large Multilingual, Multi-region Website in Drupal 8 Stella

A Reuse- and Prototyping- based Approach for the Specification of Building Automation Systems

The Need for Distributed Intelligence Automation Implemented through Four Overlapping Approaches !

Building Controls and Automation Being Prepared for Complex Building Systems May 1, 2019

NETxAutomation building management sof t ware AUTOMATION NETxAutomation Software GmbH Austrian

1 Automation Overview Definition Automation (automation, Automation ) : 1) set of all measures

Distributed Workflow-Driven Analysis of Large-Scale Biological Data using bioKepler Ilkay

Desired Properties in a Storage System (For building large-scale, geographically-distributed

Motivation Large-scale distributed systems becoming more common multiple datacenters, cloud

efficiently by installing a cost-effective Building Automation System (BAS) P.O. Box 870 32

Building an Extension Card for the TRS Building an Extension Card for the TRS -80: 80: Lessons

Test automation Building automatically repeatable test suites Test automation n Test automation

About Us Automation & Co is the association of two passionates about works quality and

Large-Scale Distributed Systems and Networks TDDE35 Lectures on Embedded Systems Petru Eles

Consistency in Distributed Systems Recall the fundamental DS properties DS may be large in

Large-Scale Systems: WebOS Access to geographically distributed data-dissemination and

DAAD Summerschool Curitiba 2011 Aspects of Large Scale High Speed Computing Building Blocks of a

breakthrough technologies in the field of automation 3D printing and distributed manufacturing and

Baccalaureate of Applied Science (BAS) Industrial Automation Building upon more than 100 years

Distributed Automation System based on Java and Web Services Nikolay Kakanakov Mitko Shopov

Automation and standardization of semantic video annotations for large-scale empirical film

Lessons on Process Automation in the Enterprise Michael Carr The Vanguard Group Principal,

Industrial Automation Automation Industrielle Industrielle Automation 9.2 Dependability -

Lessons From Building Automation For a Large Distributed Database - PowerPoint PPT Presentation

Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL Presenters Leigh Johnson Ameet Kotian (she/her/hers) (he/him/his) Staff DRE, Slack Staff DRE, Slack Google

How to Test the Ability of Large-Scale, Distributed Software Systems to Cope with Failures Pavel

Lessons Learned from Building a Large Multilingual, Multi-region Website in Drupal 8 Stella

A Reuse- and Prototyping- based Approach for the Specification of Building Automation Systems

The Need for Distributed Intelligence Automation Implemented through Four Overlapping Approaches !

Building Controls and Automation Being Prepared for Complex Building Systems May 1, 2019

NETxAutomation building management sof t ware AUTOMATION NETxAutomation Software GmbH Austrian

1 Automation Overview Definition Automation (automation, Automation ) : 1) set of all measures

Distributed Workflow-Driven Analysis of Large-Scale Biological Data using bioKepler Ilkay

Desired Properties in a Storage System (For building large-scale, geographically-distributed

Motivation Large-scale distributed systems becoming more common multiple datacenters, cloud

efficiently by installing a cost-effective Building Automation System (BAS) P.O. Box 870 32

Building an Extension Card for the TRS Building an Extension Card for the TRS -80: 80: Lessons

Test automation Building automatically repeatable test suites Test automation n Test automation

About Us Automation &amp; Co is the association of two passionates about works quality and

Large-Scale Distributed Systems and Networks TDDE35 Lectures on Embedded Systems Petru Eles

Consistency in Distributed Systems Recall the fundamental DS properties DS may be large in

Large-Scale Systems: WebOS Access to geographically distributed data-dissemination and

DAAD Summerschool Curitiba 2011 Aspects of Large Scale High Speed Computing Building Blocks of a

breakthrough technologies in the field of automation 3D printing and distributed manufacturing and

Baccalaureate of Applied Science (BAS) Industrial Automation Building upon more than 100 years

Distributed Automation System based on Java and Web Services Nikolay Kakanakov Mitko Shopov

Automation and standardization of semantic video annotations for large-scale empirical film

Lessons on Process Automation in the Enterprise Michael Carr The Vanguard Group Principal,

Industrial Automation Automation Industrielle Industrielle Automation 9.2 Dependability -

About Us Automation & Co is the association of two passionates about works quality and