lessons from building automation for a large distributed
play

Lessons From Building Automation For a Large Distributed Database - PowerPoint PPT Presentation

Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL Presenters Leigh Johnson Ameet Kotian (she/her/hers) (he/him/his) Staff DRE, Slack Staff DRE, Slack Google


  1. Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL

  2. Presenters Leigh Johnson Ameet Kotian (she/her/hers) (he/him/his) Staff DRE, Slack Staff DRE, Slack Google Developer Expert Past - SRE, Twitter (Machine Learning) @ameetkotian @grepLeigh

  3. Slack’s mission is to make people’s working lives simpler , more pleasant , and more productive .

  4. Agenda 1. Evolution of Slack Automation 2. Case Study: Self-Healing Databases 3. Lessons 4. Q & A

  5. Evolution of Monitoring Alerts Database Checklists Automation Scripts Self-Healing Stateful Systems in 4 Steps Automated Workflows

  6. Monitoring Alerts Metrics Events Logs Traces Dashboards PagerDuty

  7. Just follow these 19 easy steps! Checklists Runbooks Shared documents Lots of hand-over Limited by team capacity

  8. Convert Runbooks to Code

  9. Convert Runbooks to Code

  10. Just follow these 5 easy steps! Manual Scripts $ ./provision.sh $ ./backup_restore.sh Difficult to maintain $ ./validate_replication.sh $ ./service_discovery_stuff.sh Context switching $ ./deprovision_server.sh API contracts Multi-step process or...

  11. Just follow this 1 easy steps! Write a Do-Everything Script $ ./fix-it-now.sh

  12. Automated Workflows Systems (not humans ) Detects failure mode Executes appropriate response Fail Open

  13. You will still need firefighters after installing a fire suppression system. Automation is not a magic bullet.

  14. Automated Workflows are an Investment Many teams stop automating at the scripts stage Instrument, monitor, support n + 1 systems

  15. Do Due Diligence If automation is an investment ... Quantify value proposition Measure return on investment

  16. Do Due Diligence If automation is an investment ... Quantify value proposition Measure return on investment Include qualitative data

  17. Are you ready to kill your pet databases?

  18. How did we take Slack from...

  19. Presenters Leigh Johnson Ameet Kotian (she/her/hers) (he/him/his) Staff DRE, Slack Staff DRE, Slack Google Developer Expert Past - SRE, Twitter (Machine Learning) @ameetkotian @grepLeigh

  20. Case Study Building automation for remediating database failures

  21. Goals

  22. Goals Reliably detect MySQL host failures

  23. Goals Reliably detect MySQL host failures Automatically remediate failed MySQL hosts

  24. Goals Reliably detect MySQL host failures Automatically remediate failed MySQL hosts Scale (security fixes, kernel upgrades etc.)

  25. Slack’s database architecture

  26. Self-manage MySQL on AWS i3 Slack’s instances database architecture Data is sharded across thousands of hosts Two main types of clusters - Legacy and Vitess

  27. 1. Legacy shard Application level team-sharded active primary-primary MySQL setup.

  28. 1. Legacy shard Application level team-sharded active primary-primary MySQL setup. Some shards have read replicas

  29. Strength in Numbers: Slack's For more Database Architecture. details... 2nd Oct 2019, 1:30 PM Guido Iaquinti, Josh Varner

  30. What is Vitess? Slack is moving to Database solution for MySQL Vitess Deploy, scale and manage large MySQL cluster Built on top of MySQL replication and InnoDB MySQL features + scalability of a NoSQL database Open source project by YouTube (Google) Started in 2010 Cloud Native Computing Foundation endorsed project Ability to run each component in a container

  31. 2. Vitess shard primary-replica MySQL setup ~40% of Slack’s database queries served by Vitess

  32. My First 90 Days with Vitess. For more details on Vitess... 2nd Oct 2019, 11:00 AM Morgan Tocker https://vitess.io/ https://vitess.slack.com/

  33. What if there is a failure? Legacy shard Vitess shard

  34. Salvage or replace?

  35. Always replace!

  36. Failure detection

  37. Auto remediation requires accurate failure signals

  38. We use Orchestrator for automatic master failovers for Vitess cluster and it met all the requirements to detect host failures https://github.com/github/orchestrator

  39. Primary failures Legacy shard Vitess shard

  40. Use orchestrator hooks triggered on master failovers

  41. Replica failures Legacy shard Vitess shard

  42. Use orchestrator problems api to detect replica failures

  43. Orchestrator can Orchestrator is distributed ● be used to and uses multiple probes to generate detect failures. It uses knowledge of mysql ● accurate failure state and replication to detect signals failures. ● Rich set of APIs Has inbuilt concept of a shard ● Downtime/Maintenance ● mode

  44. Automated Remediation at Scale

  45. Failure Event Provision Event Handler Workflow ???

  46. Automation Components Task Queue Scheduler APIs

  47. Automation Components Workflows Audit Logs Web UI

  48. Failure Event Provision Event Handler Workflow Celery

  49. Distributed Task Queue Framework Celery Manual Script

  50. Distributed Task Queue Framework Celery Task API

  51. Distributed Task Queue Framework Celery Task API Workflows

  52. Distributed Task Queue Framework Celery Task API Workflows

  53. Distributed Task Queue Framework Celery Task API Workflows Queue Isolation Rate Limits Retry Behavior

  54. Distributed Task Queue Framework Celery Task API Workflows Queue Isolation Rate Limits Retry Behavior Scheduler

  55. Distributed Task Queue Framework Celery Task API Workflows Queue Isolation Rate Limits Retry Behavior Scheduler pip install celery-flower Web UI

  56. Distributed Task Queue Framework Celery Task API Workflows Queue Isolation Rate Limits Retry Behavior pip install celery-slack-webhooks Scheduler Web UI github.com/leigh-johnson/celery-slack-webhooks Slack Notifications

  57. Distributed Lock Any Strongly Consistent DB

  58. Lessons learned

  59. Three important things - Safety, safety, safety...

  60. Automation software is just like regular software...use the same scalability/reliability principles

  61. Automation software is just regular software... release early and often

  62. You need a rollout strategy… And a rollback strategy

  63. Show Value Proposition

  64. Commitment to automation

  65. Questions? We are hiring in DREs, SREs in San Francisco and Dublin locations

Recommend


More recommend